Executive Summary

Background: This report explores the application of clustering analysis to identify distinct groups within a dataset related to diabetes risk factors. The dataset includes variables such as age, glucose levels, insulin levels, BMI, blood pressure, and genetic predisposition factors.

Objective: The objective was to categorize individuals into clusters based on their health profiles and assess their respective risks of developing diabetes. By understanding these clusters, targeted interventions could be developed to mitigate diabetes risks effectively.

Results: Four distinct clusters were identified: - Cluster 1: Young adults with low diabetes risk. - Cluster 2: Middle-aged individuals with high diabetes risk. - Cluster 3: Young adults with moderate diabetes risk. - Cluster 4: Older adults with moderate diabetes risk.

Each cluster exhibited unique characteristics in terms of age, glucose levels, insulin levels, BMI, blood pressure, and genetic predisposition. Strategic interventions were recommended for each cluster to optimize diabetes prevention and management efforts.

Conclusion: By tailoring interventions to the specific needs of each cluster, healthcare providers can enhance the effectiveness of diabetes prevention strategies. This targeted approach not only improves health outcomes but also contributes to reducing the overall burden of diabetes in the population.

Exploratory Data Analysis

In this section, an exploratory data analysis (EDA) was conducted on the diabetes dataset. The primary objective was to understand the distribution of each variable, identify missing values, and explore potential relationships between features. This analysis served as a foundation for subsequent modeling and predictive analysis.

Data Overview

The dataset was loaded and the first few rows were displayed to get an initial glimpse of the data structure. A summary of the dataset was generated to gain insights into the central tendency and dispersion of each feature. Missing values were checked in the dataset, as they can significantly impact the analysis and modeling. The number of missing values in each column was calculated and displayed.

# Load the data
data <- read_csv("diabetes.csv")
## Rows: 768 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, D...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows of the dataset
print(head(data))
## # A tibble: 6 × 9
##   Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI
##         <dbl>   <dbl>         <dbl>         <dbl>   <dbl> <dbl>
## 1           6     148            72            35       0  33.6
## 2           1      85            66            29       0  26.6
## 3           8     183            64             0       0  23.3
## 4           1      89            66            23      94  28.1
## 5           0     137            40            35     168  43.1
## 6           5     116            74             0       0  25.6
## # ℹ 3 more variables: DiabetesPedigreeFunction <dbl>, Age <dbl>, Outcome <dbl>
# Summary of the dataset
print(summary(data))
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000
# Check for missing values
missing_values <- sapply(data, function(x) sum(is.na(x)))
print(missing_values)
##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

Distribution of Features

Histograms were constructed for each feature to visualise their distributions. These visualisations highlight how data points are distributed across different ranges, offering insights into the prevalence and spread of each variable.

Histograms

# Helper function to create histograms
create_histogram <- function(data, column, title, binwidth, fill_color) {
  ggplot(data, aes_string(x = column)) +
    geom_histogram(aes(y = 100 * (..count..) / sum(..count..)), binwidth = binwidth, colour = "black", fill = fill_color) +
    ggtitle(title) +
    ylab("Percentage") +
    theme_minimal() +
    theme(plot.title = element_text(size = 14, face = "bold"),
          axis.title = element_text(size = 12),
          axis.text = element_text(size = 8))
}

# Create histograms for each feature with new colors
p1 <- create_histogram(data, "Pregnancies", "Number of Pregnancies", 1, "#1f77b4") # Blue
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p2 <- create_histogram(data, "Glucose", "Glucose", 5, "#ff7f0e") # Orange
p3 <- create_histogram(data, "BloodPressure", "Blood Pressure", 2, "#2ca02c") # Green
p4 <- create_histogram(data, "SkinThickness", "Skin Thickness", 2, "#d62728") # Red
p5 <- create_histogram(data, "Insulin", "Insulin", 20, "#9467bd") # Purple
p6 <- create_histogram(data, "BMI", "Body Mass Index", 1, "#8c564b") # Brown
p7 <- create_histogram(data, "DiabetesPedigreeFunction", "Diabetes Pedigree Function", 0.05, "#e377c2") # Pink
p8 <- create_histogram(data, "Age", "Age", 1, "#7f7f7f") # Gray

# Arrange plots in a grid layout with larger size
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol = 2)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

### Histogram Analysis

Number of Pregnancies

The histogram reveals that a significant portion of individuals in the dataset have 0 to 5 pregnancies, indicating a prevalent demographic within the study group. A smaller proportion of individuals have higher numbers of pregnancies, such as 10, 15, and 20.

Glucose

The distribution of glucose levels appears normal, with a peak around 80 to 150. This suggests that most individuals in the dataset have glucose levels within this range, which is crucial for understanding metabolic health.

Blood Pressure

The histogram for blood pressure shows a right-skewed distribution, indicating that a substantial number of individuals have blood pressure readings clustered around 70 to 80. This skewness implies potential variations in blood pressure across the dataset.

Skin Thickness

Skin thickness distribution is also right-skewed, with most individuals having thickness measurements between 20 and 40. This metric is essential in assessing overall health and potential metabolic conditions.

Insulin

The insulin distribution is is approximately, with a concentration of individuals showing medium insulin levels, particularly around 100.

Diabetes Pedigree Function

The histogram for the diabetes pedigree function reveals a right-skewed distribution, with the majority of individuals having function values less than 1. This metric provides insights into the genetic predisposition to diabetes within the study population.

BMI (Body Mass Index)

BMI distribution appears roughly normal, centered around 20 to 40. This standard measure of body composition highlights the prevalence of healthy weight ranges within the dataset.

Age

The age distribution is right-skewed, indicating that a significant number of individuals are younger, with ages clustering around 20 to 40. Understanding age demographics is crucial for analysing health outcomes across different age groups.

Summary

the histograms provide valuable insights into the distribution and central tendencies of critical health metrics within the diabetes dataset. These findings serve as a foundational analysis for further exploration and modeling efforts, contributing to informed decision-making in healthcare and medical research.

Relationships Between Features

Density plots were created to explore the distribution of each feature, segmented by diabetes outcome. These plots help identify potential patterns or differences in feature distributions between individuals with and without diabetes. Furthermore, scatter plots were used to visualise the relationship between pairs of features, with data points color-coded based on the outcome variable (diabetes presence). This helped in identifying any patterns or trends that existed between these features.

Density Plot

# Helper function to create density plots with outcome comparison
create_density_with_outcome <- function(data, column, title) {
  mean_values <- data %>%
    group_by(Outcome) %>%
    summarize(mean_value = mean(get(column), na.rm = TRUE)) %>%
    ungroup()
  
  ggplot(data, aes_string(x = column, fill = "as.factor(Outcome)")) +
    geom_density(alpha = 0.5) +
    geom_vline(data = mean_values, aes(xintercept = mean_value, color = as.factor(Outcome)),
               linetype = "dotted", size = 1) +
    scale_fill_manual(values = c("#FFFF00", "#008080")) +  # Yellow and Teal hex codes
    scale_color_manual(values = c("red", "blue")) +
    labs(title = title, fill = "Outcome", color = "Outcome") +
    theme_minimal() +
    theme(plot.title = element_text(size = 10, face = "bold"),
          axis.title = element_text(size = 10),
          axis.text = element_text(size = 8))
}

# Create density plots for each feature with outcome comparison
p1 <- create_density_with_outcome(data, "Pregnancies", "Pregnancies vs Diabetes")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p2 <- create_density_with_outcome(data, "Glucose", "Glucose vs Diabetes")
p3 <- create_density_with_outcome(data, "BloodPressure", "Blood Pressure vs Diabetes")
p4 <- create_density_with_outcome(data, "SkinThickness", "Skin Thickness vs Diabetes")
p5 <- create_density_with_outcome(data, "Insulin", "Insulin vs Diabetes")
p6 <- create_density_with_outcome(data, "BMI", "BMI vs Diabetes")
p7 <- create_density_with_outcome(data, "DiabetesPedigreeFunction", "Diabetes Pedigree Function vs Diabetes")
p8 <- create_density_with_outcome(data, "Age", "Age vs Diabetes")

# Arrange plots in a grid layout with larger size
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol = 2)

### Density Plot Analysis

Pregnancies vs. Diabetes Outcome

The density plots show that individuals with diabetes tend to have a slightly higher mean number of pregnancies compared to those without diabetes. This suggests a possible correlation between higher pregnancy numbers and diabetes risk.

Glucose vs. Diabetes Outcome

The distribution of glucose levels shows that individuals with diabetes have significantly higher mean glucose levels than those without diabetes. This strong differentiation highlights glucose as a critical factor in diabetes diagnosis and management.

Blood Pressure vs. Diabetes Outcome

Blood pressure distributions reveal a subtle difference in mean values between individuals with and without diabetes. While there is a slight variation, it suggests that blood pressure alone may not be a strong differentiator for diabetes in this population.

Skin Thickness vs. Diabetes Outcome

The skin thickness density plots indicate little difference between the two groups, suggesting that this metric does not strongly distinguish between diabetes and non-diabetes individuals.

Insulin vs. Diabetes Outcome

The insulin level distributions show that individuals with diabetes tend to have slightly higher mean insulin levels. This finding supports the role of hyperinsulinaemia in the development of diabetes.

BMI (Body Mass Index) vs. Diabetes Outcome

BMI is higher on average for individuals with diabetes. This correlation aligns with known associations between higher body mass index and increased diabetes risk.

Diabetes Pedigree Function vs. Diabetes Outcome

The diabetes pedigree function values are slightly higher for individuals with diabetes, indicating a possible genetic predisposition in these cases.

Age vs. Diabetes Outcome

Older individuals tend to have a higher mean age in the diabetes group. This suggests that age is a significant factor in the prevalence of diabetes, with older individuals being more at risk.

Scatter Plot

# Function to create scatter plot with outcome comparison
create_scatter_with_outcome <- function(data, x_col, y_col, x_title, y_title) {
  ggplot(data, aes_string(x = x_col, y = y_col, color = "as.factor(Outcome)")) +
    geom_point(alpha = 0.7) +
    labs(x = x_title, y = y_title, color = "Outcome") +
    theme_minimal() +
    theme(plot.title = element_text(size = 14, face = "bold"),
          axis.title = element_text(size = 12),
          axis.text = element_text(size = 10))
}

# Create scatter plot for a couple of variables
scatter_plot1 <- create_scatter_with_outcome(data, "Glucose", "BMI", "Glucose", "BMI")
scatter_plot2 <- create_scatter_with_outcome(data, "Age", "BloodPressure", "Age", "Blood Pressure")

# Arrange plots in a grid layout
grid.arrange(scatter_plot1, scatter_plot2, ncol = 2)

### Scatter Plot Analysis

Glucose vs. BMI

The scatter plot of Glucose levels against BMI (Body Mass Index) offers valuable insights into the relationship between these two variables. Glucose levels range from 0 to 200, while BMI values span from 0 to 60.

In examining the scatter plot, a noticeable positive correlation between Glucose levels and BMI is evident. Higher Glucose levels generally correspond with higher BMI values. This trend is particularly observable among individuals with diabetes, who typically exhibit both elevated Glucose and BMI levels compared to those without diabetes. However, the distinction between the two groups in this plot is not very pronounced, suggesting that additional factors might also play significant roles in differentiating between the outcomes of diabetes and non-diabetes.

Age vs. Blood Pressure

The scatter plot analysing Age versus Blood Pressure delves into the interaction between these two variables. Age ranges from 20 to 80 years, and Blood Pressure values extend from 0 to 125.

Observations from this scatter plot indicate no clear linear relationship between Age and Blood Pressure. The data points are widely scattered, highlighting a substantial variability in Blood Pressure across different ages. This variability suggests that Blood Pressure is influenced by multiple factors beyond Age alone. Despite the lack of a strong linear relationship, a slight trend of increasing Blood Pressure with Age is discernible. This trend aligns with the general medical understanding that Blood Pressure tends to rise as individuals age.

These scatter plots visually depict the complex relationships between these health metrics and their association with diabetes.

Pairwise Relationships

Pairwise scatter plots were utilized to examine relationships between pairs of numeric variables. Each plot included data points colored by diabetes outcome, facilitating visual identification of correlations or trends between variables.

Pairwise Scatter Plot

# Convert Outcome to factor with appropriate levels
data$Outcome <- factor(data$Outcome, levels = c(0, 1))

# Select only numeric columns (excluding "Outcome")
numeric_data <- data[, sapply(data, is.numeric) & !(names(data) %in% "Outcome")]

# Define a custom color palette for Outcome
my_colors <- c("#1f77b4", "#ff7f0e")  # Blue and Orange

# Create a custom wrap function for points to include color
wrap_points <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) +
    geom_point(alpha = 0.5, ...) +
    scale_color_manual(values = my_colors)
}

# Plot pairwise scatter plots using GGally with custom aesthetics
ggpairs(data,
        columns = which(sapply(data, is.numeric) & !(names(data) %in% "Outcome")),
        mapping = ggplot2::aes(color = Outcome),
        lower = list(continuous = wrap_points),
        upper = list(continuous = wrap("cor", size = 3)),
        diag = list(continuous = wrap("barDiag", binwidth = 1)),
        title = "Pairwise Scatter Plots of Numeric Variables"
)

### Pairwise Scatter Plot Analysis

This analysis explores the relationships between different numerical variables related to diabetes through scatter plots. Each plot highlights the interaction between two variables, providing insights into potential correlations and patterns.

Pregnancies vs. Other Variables

The scatter plots show that the number of pregnancies has a weak positive correlation with Glucose levels. However, no strong patterns or significant correlations emerge with other variables, indicating that the number of pregnancies may not be a strong predictor for other health metrics in this dataset.

Glucose vs. Other Variables

Glucose levels exhibit a positive correlation with BMI and Insulin (Body Mass Index), suggesting that individuals with higher glucose levels tend to have higher BMI and Insulin values. This correlation aligns with the understanding that elevated glucose levels, hyperinsulinaemia and increased body weight are often linked. However, no clear patterns are observed between glucose levels and other variables.

Blood Pressure vs. Other Variables

Blood pressure does not show strong correlations with other variables in the dataset. The scatter plots reveal a wide dispersion of blood pressure values across different levels of other variables, indicating a lack of significant linear relationships.

Skin Thickness vs. Other Variables

Skin thickness does not demonstrate significant correlations with other variables. The scatter plots suggest that skin thickness is relatively independent of other health metrics in this dataset, showing no strong linear patterns.

Insulin vs. Other Variables

Insulin levels show strong positive correlation with glucose levels and have a weak positive correlation with BMI, suggesting that individuals with higher insulin levels may also have higher BMI values. However, no strong patterns are observed between insulin levels and other variables, indicating that insulin is not a strong predictor of other health metrics in this dataset.

BMI vs. Other Variables

BMI shows a positive correlation with glucose levels, reinforcing the link between increased body weight and higher glucose levels. However, BMI does not exhibit clear patterns with other variables, suggesting that while BMI and glucose levels are related, BMI alone is not strongly predictive of other health metrics.

Diabetes Pedigree Function vs. Other Variables

The diabetes pedigree function does not show strong correlations with other variables. The scatter plots indicate that this metric, which reflects genetic predisposition to diabetes, operates independently of the other health metrics in this dataset.

Age vs. Other Variables

Age exhibits a weak positive correlation with blood pressure, aligning with the understanding that blood pressure tends to increase with age. However, no strong patterns are observed between age and other variables, indicating that age alone is not a strong predictor of other health metrics in this dataset.

Summary

These scatter plots provide a visual overview of the relationships between various health metrics related to diabetes. While some correlations and patterns are observable, many variables do not show strong linear relationships, highlighting the complexity of predicting diabetes and related health outcomes based on these metrics alone.

Central Tendency Using Box Plots

Box plots were generated to illustrate the distribution of numerical variables across different diabetes outcomes. Each plot depicts the spread and central tendency of variables within each outcome category, offering insights into potential differences between groups.

# Define custom colors
my_colors <- c("#FFFF66", "#66CCCC", "#FF9966")  # Yellow, Teal, Peach

# Reshape data for ggplot
data_long <- data %>%
  gather(key = "variable", value = "value", -Outcome)

# Create box plots using ggplot
ggplot(data_long, aes(x = Outcome, y = value, fill = as.factor(Outcome))) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Box Plots of Numerical Variables by Outcome") +
  theme_minimal() +
  scale_fill_manual(values = my_colors) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels if needed

### Box Plot Analysis

The box plots provide a comparative visualisation of numerical variables based on the diabetes outcome categories (labeled as ‘0’ for no diabetes and ‘1’ for diabetes). These plots help in understanding the distribution and spread of each variable within the two outcome groups.

Age

The median age is higher for individuals in the diabetic category. However, the age distribution for individuals with diabetes (outcome ‘1’) is wider, indicating a greater variability in age among diabetic individuals compared to those without diabetes (outcome ‘0’).

Blood Pressure

The median blood pressure is slightly higher for individuals with diabetes (outcome ‘1’). The variability is similar in blood pressure values for both groups.

BMI (Body Mass Index)

The median BMI is higher for the diabetes group (outcome ‘1’), indicating that individuals with diabetes tend to have a higher body mass index. Additionally, the range of BMI values is wider for this group, reflecting greater variability in body weight among diabetic individuals.

Diabetes Pedigree Function

The median value for the diabetes pedigree function is slightly higher for individuals with diabetes (outcome ‘1’). This function, which represents genetic predisposition, shows more variability among those with diabetes, suggesting diverse genetic factors at play.

Glucose

The median glucose level is significantly higher for the diabetes group (outcome ‘1’). There is less overlap between the two outcomes for glucose, indicating that higher glucose levels are strongly associated with diabetes.

Insulin

Individuals with diabetes (outcome ‘1’) have a lower median insulin level. However, the overall trend shows that higher insulin levels are more common among diabetic individuals.

Pregnancies

The median number of pregnancies is higher for individuals with diabetes (outcome ‘1’). The range of pregnancy counts is also wider for this group, indicating greater variability in the number of pregnancies among diabetic individuals.

Skin Thickness

The median skin thickness is slightly similar for both outcome groups. However, the variability in skin thickness is greater for individuals with diabetes (outcome ‘1’), suggesting more diverse skin thickness measurements among this group.

Summary

These box plots highlight the differences in distribution and variability of key health metrics between individuals with and without diabetes. They provide valuable insights into how these variables are associated with the presence of diabetes, helping to identify potential risk factors and areas for further investigation.

Statistical Relationships

A correlation matrix was computed and visualised to quantify the strength and direction of relationships between numeric variables. This analysis provides insights into variables that may influence each other and helps prioritise features for further investigation.

### Correlation Plot Analysis

The correlation plot offers a comprehensive view of how various health-related variables interrelate, providing insights into their mutual influences. This analysis is crucial for understanding which factors might influence others and how they collectively contribute to health outcomes.

Age

Age shows a weak positive correlation with glucose levels, suggesting a tendency for glucose levels to increase slightly with age. This observation shows the importance of age as a factor in understanding metabolic health changes over time. Furthermore, age demonstrates a significant correlation with the number of pregnancies, indicating that older individuals tend to have had more pregnancies throughout their lives.

Blood Pressure

In this plot, blood pressure exhibits weak correlations with BMI and age. This finding suggests that changes in blood pressure are not strongly influenced by variations in BMI or age within this dataset. However, it highlights the need for further investigation into other potential factors that may impact blood pressure variability.

BMI (Body Mass Index)

BMI shows a positive correlation with glucose levels, indicating that individuals with higher BMI tend to have higher glucose levels. This association affirms the link between obesity and metabolic health, where higher BMI can contribute to increased glucose levels. There were no strong correlations observed between BMI and other variables in this analysis.

Diabetes Pedigree Function

The diabetes pedigree function does not show significant correlations with any other variables in the plot. This result suggests that genetic predisposition to diabetes, as measured by the pedigree function, operates independently of the other health metrics included in this study. This finding emphasises the complex nature of diabetes susceptibility, involving both genetic and environmental factors.

Glucose

Glucose levels demonstrate a positive correlation with insulin levels. This relationship indicates that as glucose levels rise, insulin levels tend to increase as well, reflecting the body’s response to maintain glucose homeostasis. Weak correlations were also observed between glucose levels and both BMI and age, suggesting minor associations with these variables.

Insulin

Insulin levels show weak correlations with skin thickness. This finding suggests that insulin levels may be influenced to a small degree by variations in this health metric. Understanding these relationships can provide insights into insulin regulation and its role in metabolic health.

Pregnancies

The number of pregnancies exhibits a strong positive correlation with age. This correlation highlights a natural life course relationship, where older individuals tend to have had more pregnancies. This observation is relevant for understanding reproductive health impacts and potential implications for metabolic health.

Skin Thickness

Skin thickness shows weak correlations with insulin and BMI. This finding suggests limited associations between skin thickness and these health metrics within the dataset. Further exploration may reveal additional insights into the physiological implications of skin thickness in relation to metabolic health.

Summary

The correlation plot analysis provides an understanding of how various health-related variables interact and influence each other.

Preprocessing

Preprocessing the data is crucial to ensure that it is clean, consistent, and ready for further analysis. This section outlines the steps taken to prepare the dataset for modeling and analysis.

Checking and Handling Zero Values

Zero values in certain variables can sometimes indicate missing data or outliers. It is essential to identify and appropriately handle these values to avoid bias in subsequent analyses.

# Check for zero values in numeric_data
zero_counts <- sapply(numeric_data, function(x) sum(x == 0))

# Print the results
print(zero_counts)
##              Pregnancies                  Glucose            BloodPressure 
##                      111                        5                       35 
##            SkinThickness                  Insulin                      BMI 
##                      227                      374                       11 
## DiabetesPedigreeFunction                      Age 
##                        0                        0

Imputation of Missing Values

Columns such as Glucose, BloodPressure, SkinThickness, Insulin, and BMI often contain zero values that are not plausible for these health-related metrics. These zeros are replaced with NA (Not Available) to signify missing data, enabling more accurate imputation.

KNN Imputation

The K-nearest neighbors (KNN) imputation method is employed using the VIM package. This technique fills in missing values based on the values of neighboring data points, ensuring that imputed values are realistic and contextually appropriate for health-related metrics.

Removing Imputed Columns

After imputation, columns suffixed with _imp are removed from the dataset. This cleanup step ensures that only the original variables and their imputed values remain for further analysis, reducing redundancy and maintaining clarity in the dataset structure.

Conversion of Outcome Variable

The “Outcome” variable, which denotes the presence (1) or absence (0) of diabetes, is converted to a factor. This conversion allows for categorical analysis and ensures that the model interprets this variable correctly during predictive modeling and statistical analyses.

##     Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1             6     148            72            35     175 33.6
## 2             1      85            66            29      55 26.6
## 3             8     183            64            28     325 23.3
## 4             1      89            66            23      94 28.1
## 5             0     137            40            35     168 43.1
## 6             5     116            74            27     112 25.6
## 7             3      78            50            32      88 31.0
## 8            10     115            68            39     122 35.3
## 9             2     197            70            45     543 30.5
## 10            8     125            96            36     150 32.7
## 11            4     110            92            38     105 37.6
## 12           10     168            74            32     171 38.0
## 13           10     139            80            19     135 27.1
## 14            1     189            60            23     846 30.1
## 15            5     166            72            19     175 25.8
## 16            7     100            72            29     130 30.0
## 17            0     118            84            47     230 45.8
## 18            7     107            74            30     115 29.6
## 19            1     103            30            38      83 43.3
## 20            1     115            70            30      96 34.6
## 21            3     126            88            41     235 39.3
## 22            8      99            84            26     105 35.4
## 23            7     196            90            32     280 39.8
## 24            9     119            80            35     130 29.0
## 25           11     143            94            33     146 36.6
## 26           10     125            70            26     115 31.1
## 27            7     147            76            33     304 39.4
## 28            1      97            66            15     140 23.2
## 29           13     145            82            19     110 22.2
## 30            5     117            92            38      75 34.1
## 31            5     109            75            26     105 36.0
## 32            3     158            76            36     245 31.6
## 33            3      88            58            11      54 24.8
## 34            6      92            92            27      51 19.9
## 35           10     122            78            31     100 27.6
## 36            4     103            60            33     192 24.0
## 37           11     138            76            33     122 33.2
## 38            9     102            76            37     175 32.9
## 39            2      90            68            42     129 38.2
## 40            4     111            72            47     207 37.1
## 41            3     180            64            25      70 34.0
## 42            7     133            84            39     235 40.2
## 43            7     106            92            18     135 22.7
## 44            9     171           110            24     240 45.4
## 45            7     159            64            31     193 27.4
## 46            0     180            66            39     465 42.0
## 47            1     146            56            25     128 29.7
## 48            2      71            70            27      50 28.0
## 49            7     103            66            32     130 39.1
## 50            7     105            88            33      89 34.2
## 51            1     103            80            11      82 19.4
## 52            1     101            50            15      36 24.2
## 53            5      88            66            21      23 24.4
## 54            8     176            90            34     300 33.7
## 55            7     150            66            42     342 34.7
## 56            1      73            50            10      54 23.0
## 57            7     187            68            39     304 37.7
## 58            0     100            88            60     110 46.8
## 59            0     146            82            40     272 40.5
## 60            0     105            64            41     142 41.5
## 61            2      84            78            32      89 37.2
## 62            8     133            72            35     125 32.9
## 63            5      44            62            29      54 25.0
## 64            2     141            58            34     128 25.4
## 65            7     114            66            29     156 32.8
## 66            5      99            74            27     100 29.0
## 67            0     109            88            30     110 32.5
## 68            2     109            92            30     190 42.7
## 69            1      95            66            13      38 19.6
## 70            4     146            85            27     100 28.9
## 71            2     100            66            20      90 32.9
## 72            5     139            64            35     140 28.6
## 73           13     126            90            36     150 43.4
## 74            4     129            86            20     270 35.1
## 75            1      79            75            30      50 32.0
## 76            1     107            48            20     100 24.7
## 77            7      62            78            30      71 32.6
## 78            5      95            72            33      75 37.7
## 79            0     131            76            40     230 43.2
## 80            2     112            66            22      94 25.0
## 81            3     113            44            13      86 22.4
## 82            2      74            78            32      89 32.0
## 83            7      83            78            26      71 29.3
## 84            0     101            65            28      94 24.6
## 85            5     137           108            36     220 48.8
## 86            2     110            74            29     125 32.4
## 87           13     106            72            54     105 36.6
## 88            2     100            68            25      71 38.5
## 89           15     136            70            32     110 37.1
## 90            1     107            68            19     110 26.5
## 91            1      80            55            15      76 19.1
## 92            4     123            80            15     176 32.0
## 93            7      81            78            40      48 46.7
## 94            4     134            72            27     175 23.8
## 95            2     142            82            18      64 24.7
## 96            6     144            72            27     228 33.9
## 97            2      92            62            28      87 31.6
## 98            1      71            48            18      76 20.4
## 99            6      93            50            30      64 28.7
## 100           1     122            90            51     220 49.7
## 101           1     163            72            38     185 39.0
## 102           1     151            60            25     168 26.1
## 103           0     125            96            22     110 22.5
## 104           1      81            72            18      40 26.6
## 105           2      85            65            32      77 39.6
## 106           1     126            56            29     152 28.7
## 107           1      96           122            19      55 22.4
## 108           4     144            58            28     140 29.5
## 109           3      83            58            31      18 34.3
## 110           0      95            85            25      36 37.4
## 111           3     171            72            33     135 33.3
## 112           8     155            62            26     495 34.0
## 113           1      89            76            34      37 31.2
## 114           4      76            62            32      71 34.0
## 115           7     160            54            32     175 30.5
## 116           4     146            92            31     285 31.2
## 117           5     124            74            30     115 34.0
## 118           5      78            48            32      71 33.7
## 119           4      97            60            23      49 28.2
## 120           4      99            76            15      51 23.2
## 121           0     162            76            56     100 53.2
## 122           6     111            64            39      94 34.2
## 123           2     107            74            30     100 33.6
## 124           5     132            80            26     135 26.8
## 125           0     113            76            35      96 33.3
## 126           1      88            30            42      99 55.0
## 127           3     120            70            30     135 42.9
## 128           1     118            58            36      94 33.3
## 129           1     117            88            24     145 34.5
## 130           0     105            84            29     180 27.9
## 131           4     173            70            14     168 29.7
## 132           9     122            56            31     171 33.3
## 133           3     170            64            37     225 34.5
## 134           8      84            74            31      71 38.3
## 135           2      96            68            13      49 21.1
## 136           2     125            60            20     140 33.8
## 137           0     100            70            26      50 30.8
## 138           0      93            60            25      92 28.7
## 139           0     129            80            26     205 31.2
## 140           5     105            72            29     325 36.9
## 141           3     128            78            26     112 21.1
## 142           5     106            82            30      75 39.5
## 143           2     108            52            26      63 32.5
## 144          10     108            66            36     130 32.4
## 145           4     154            62            31     284 32.8
## 146           0     102            75            23      89 28.7
## 147           9      57            80            37      49 32.8
## 148           2     106            64            35     119 30.5
## 149           5     147            78            27     168 33.7
## 150           2      90            70            17      53 27.3
## 151           1     136            74            50     204 37.4
## 152           4     114            65            24      74 21.9
## 153           9     156            86            28     155 34.3
## 154           1     153            82            42     485 40.6
## 155           8     188            78            32     280 47.9
## 156           7     152            88            44     210 50.0
## 157           2      99            52            15      94 24.6
## 158           1     109            56            21     135 25.2
## 159           2      88            74            19      53 29.0
## 160          17     163            72            41     114 40.9
## 161           4     151            90            38     140 29.7
## 162           7     102            74            40     105 37.2
## 163           0     114            80            34     285 44.2
## 164           2     100            64            23      87 29.7
## 165           0     131            88            30     145 31.6
## 166           6     104            74            18     156 29.9
## 167           3     148            66            25     284 32.5
## 168           4     120            68            32     152 29.6
## 169           4     110            66            32      88 31.9
## 170           3     111            90            12      78 28.4
## 171           6     102            82            32     160 30.8
## 172           6     134            70            23     130 35.4
## 173           2      87            64            23      50 28.9
## 174           1      79            60            42      48 43.5
## 175           2      75            64            24      55 29.7
## 176           8     179            72            42     130 32.7
## 177           6      85            78            30      71 31.2
## 178           0     129           110            46     130 67.1
## 179           5     143            78            39     108 45.0
## 180           5     130            82            32     110 39.1
## 181           6      87            80            27     100 23.2
## 182           0     119            64            18      92 34.9
## 183           1      89            74            20      23 27.7
## 184           5      73            60            23      49 26.8
## 185           4     141            74            36     126 27.6
## 186           7     194            68            28     280 35.9
## 187           8     181            68            36     495 30.1
## 188           1     128            98            41      58 32.0
## 189           8     109            76            39     114 27.9
## 190           5     139            80            35     160 31.6
## 191           3     111            62            22      86 22.6
## 192           9     123            70            44      94 33.1
## 193           7     159            66            33     325 30.4
## 194          11     135            90            37     150 52.3
## 195           8      85            55            20      54 24.4
## 196           5     158            84            41     210 39.4
## 197           1     105            58            20     100 24.3
## 198           3     107            62            13      48 22.9
## 199           4     109            64            44      99 34.8
## 200           4     148            60            27     318 30.9
## 201           0     113            80            16      82 31.0
## 202           1     138            82            37     160 40.1
## 203           0     108            68            20      73 27.3
## 204           2      99            70            16      44 20.4
## 205           6     103            72            32     190 37.7
## 206           5     111            72            28     110 23.9
## 207           8     196            76            29     280 37.5
## 208           5     162           104            32     231 37.7
## 209           1      96            64            27      87 33.2
## 210           7     184            84            33     277 35.5
## 211           2      81            60            22      49 27.7
## 212           0     147            85            54     255 42.8
## 213           7     179            95            31     168 34.2
## 214           0     140            65            26     130 42.6
## 215           9     112            82            32     175 34.2
## 216          12     151            70            40     271 41.8
## 217           5     109            62            41     129 35.8
## 218           6     125            68            30     120 30.0
## 219           5      85            74            22     122 29.0
## 220           5     112            66            32     129 37.8
## 221           0     177            60            29     478 34.6
## 222           2     158            90            30     165 31.6
## 223           7     119            68            28     112 25.2
## 224           7     142            60            33     190 28.8
## 225           1     100            66            15      56 23.6
## 226           1      87            78            27      32 34.6
## 227           0     101            76            32      92 35.7
## 228           3     162            52            38     194 37.2
## 229           4     197            70            39     744 36.7
## 230           0     117            80            31      53 45.2
## 231           4     142            86            38     160 44.0
## 232           6     134            80            37     370 46.2
## 233           1      79            80            25      37 25.4
## 234           4     122            68            33     130 35.0
## 235           3      74            68            28      45 29.7
## 236           4     171            72            32     225 43.6
## 237           7     181            84            21     192 35.9
## 238           0     179            90            27     185 44.1
## 239           9     164            84            21     165 30.8
## 240           0     104            76            20      70 18.4
## 241           1      91            64            24      87 29.2
## 242           4      91            70            32      88 33.1
## 243           3     139            54            35     160 25.6
## 244           6     119            50            22     176 27.1
## 245           2     146            76            35     194 38.2
## 246           9     184            85            15     156 30.0
## 247          10     122            68            33     122 31.2
## 248           0     165            90            33     680 52.3
## 249           9     124            70            33     402 35.4
## 250           1     111            86            19     116 30.1
## 251           9     106            52            25      83 31.2
## 252           2     129            84            22     110 28.0
## 253           2      90            80            14      55 24.4
## 254           0      86            68            32     100 35.8
## 255          12      92            62             7     258 27.6
## 256           1     113            64            35      96 33.6
## 257           3     111            56            39      94 30.1
## 258           2     114            68            22     105 28.7
## 259           1     193            50            16     375 25.9
## 260          11     155            76            28     150 33.3
## 261           3     191            68            15     130 30.9
## 262           3     141            78            30     190 30.0
## 263           4      95            70            32      88 32.1
## 264           3     142            80            15     155 32.4
## 265           4     123            62            29     165 32.0
## 266           5      96            74            18      67 33.6
## 267           0     138            70            38     167 36.3
## 268           2     128            64            42     158 40.0
## 269           0     102            52            20     100 25.1
## 270           2     146            70            30     135 27.5
## 271          10     101            86            37     155 45.6
## 272           2     108            62            32      56 25.2
## 273           3     122            78            36     112 23.0
## 274           1      71            78            50      45 33.2
## 275          13     106            70            33     180 34.2
## 276           2     100            70            52      57 40.5
## 277           7     106            60            24     129 26.5
## 278           0     104            64            23     116 27.8
## 279           5     114            74            28     135 24.9
## 280           2     108            62            10     278 25.3
## 281           0     146            70            40     167 37.9
## 282          10     129            76            28     122 35.9
## 283           7     133            88            15     155 32.4
## 284           7     161            86            32     165 30.4
## 285           2     108            80            27     140 27.0
## 286           7     136            74            26     135 26.0
## 287           5     155            84            44     545 38.7
## 288           1     119            86            39     220 45.6
## 289           4      96            56            17      49 20.8
## 290           5     108            72            43      75 36.1
## 291           0      78            88            29      40 36.9
## 292           0     107            62            30      74 36.6
## 293           2     128            78            37     182 43.3
## 294           1     128            48            45     194 40.5
## 295           0     161            50            20     168 21.9
## 296           6     151            62            31     120 35.5
## 297           2     146            70            38     360 28.0
## 298           0     126            84            29     215 30.7
## 299          14     100            78            25     184 36.6
## 300           8     112            72            19     135 23.6
## 301           0     167            74            30     245 32.3
## 302           2     144            58            33     135 31.6
## 303           5      77            82            41      42 35.8
## 304           5     115            98            36     220 52.9
## 305           3     150            76            30     140 21.0
## 306           2     120            76            37     105 39.7
## 307          10     161            68            23     132 25.5
## 308           0     137            68            14     148 24.8
## 309           0     128            68            19     180 30.5
## 310           2     124            68            28     205 32.9
## 311           6      80            66            30      71 26.2
## 312           0     106            70            37     148 39.4
## 313           2     155            74            17      96 26.6
## 314           3     113            50            10      85 29.5
## 315           7     109            80            31     156 35.9
## 316           2     112            68            22      94 34.1
## 317           3      99            80            11      64 19.3
## 318           3     182            74            31     135 30.5
## 319           3     115            66            39     140 38.1
## 320           6     194            78            32     300 23.5
## 321           4     129            60            12     231 27.5
## 322           3     112            74            30     135 31.6
## 323           0     124            70            20     115 27.4
## 324          13     152            90            33      29 26.8
## 325           2     112            75            32     135 35.7
## 326           1     157            72            21     168 25.6
## 327           1     122            64            32     156 35.1
## 328          10     179            70            33     122 35.1
## 329           2     102            86            36     120 45.5
## 330           6     105            70            32      68 30.8
## 331           8     118            72            19      87 23.1
## 332           2      87            58            16      52 32.7
## 333           1     180            74            31     180 43.3
## 334          12     106            80            31     100 23.6
## 335           1      95            60            18      58 23.9
## 336           0     165            76            43     255 47.9
## 337           0     117            90            26     196 33.8
## 338           5     115            76            31     156 31.2
## 339           9     152            78            34     171 34.2
## 340           7     178            84            32     225 39.9
## 341           1     130            70            13     105 25.9
## 342           1      95            74            21      73 25.9
## 343           1      93            68            35      77 32.0
## 344           5     122            86            37     105 34.7
## 345           8      95            72            26     105 36.8
## 346           8     126            88            36     108 38.5
## 347           1     139            46            19      83 28.7
## 348           3     116            64            22     105 23.5
## 349           3      99            62            19      74 21.8
## 350           5     116            80            32     175 41.0
## 351           4      92            80            36     105 42.2
## 352           4     137            84            37     130 31.2
## 353           3      61            82            28      76 34.4
## 354           1      90            62            12      43 27.2
## 355           3      90            78            32      88 42.7
## 356           9     165            88            31     165 30.4
## 357           1     125            50            40     167 33.3
## 358          13     129            76            30     150 39.9
## 359          12      88            74            40      54 35.3
## 360           1     196            76            36     249 36.5
## 361           5     189            64            33     325 31.2
## 362           5     158            70            27     168 29.8
## 363           5     103           108            37     108 39.2
## 364           4     146            78            35     300 38.5
## 365           4     147            74            25     293 34.9
## 366           5      99            54            28      83 34.0
## 367           6     124            72            29     130 27.6
## 368           0     101            64            17      82 21.0
## 369           3      81            86            16      66 27.5
## 370           1     133           102            28     140 32.8
## 371           3     173            82            48     465 38.4
## 372           0     118            64            23      89 27.7
## 373           0      84            64            22      66 35.8
## 374           2     105            58            40      94 34.9
## 375           2     122            52            43     158 36.2
## 376          12     140            82            43     325 39.2
## 377           0      98            82            15      84 25.2
## 378           1      87            60            37      75 37.2
## 379           4     156            75            36     277 48.3
## 380           0      93           100            39      72 43.4
## 381           1     107            72            30      82 30.8
## 382           0     105            68            22      58 20.0
## 383           1     109            60             8     182 25.4
## 384           1      90            62            18      59 25.1
## 385           1     125            70            24     110 24.3
## 386           1     119            54            13      50 22.3
## 387           5     116            74            29     156 32.3
## 388           8     105           100            36     215 43.3
## 389           5     144            82            26     285 32.0
## 390           3     100            68            23      81 31.6
## 391           1     100            66            29     196 32.0
## 392           5     166            76            36     210 45.7
## 393           1     131            64            14     415 23.7
## 394           4     116            72            12      87 22.1
## 395           4     158            78            32     205 32.9
## 396           2     127            58            24     275 27.7
## 397           3      96            56            34     115 24.7
## 398           0     131            66            40     165 34.3
## 399           3      82            70            22      44 21.1
## 400           3     193            70            31     225 34.9
## 401           4      95            64            30     115 32.0
## 402           6     137            61            27     190 24.2
## 403           5     136            84            41      88 35.0
## 404           9      72            78            25      68 31.6
## 405           5     168            64            35     225 32.9
## 406           2     123            48            32     165 42.1
## 407           4     115            72            29     122 28.9
## 408           0     101            62            22      58 21.9
## 409           8     197            74            28     225 25.9
## 410           1     172            68            49     579 42.4
## 411           6     102            90            39      77 35.7
## 412           1     112            72            30     176 34.4
## 413           1     143            84            23     310 42.4
## 414           1     143            74            22      61 26.2
## 415           0     138            60            35     167 34.6
## 416           3     173            84            33     474 35.7
## 417           1      97            68            21      81 27.2
## 418           4     144            82            32     210 38.5
## 419           1      83            68            23      49 18.2
## 420           3     129            64            29     115 26.4
## 421           1     119            88            41     170 45.3
## 422           2      94            68            18      76 26.0
## 423           0     102            64            46      78 40.6
## 424           2     115            64            22     106 30.8
## 425           8     151            78            32     210 42.9
## 426           4     184            78            39     277 37.0
## 427           0      94            76            31     115 35.8
## 428           1     181            64            30     180 34.1
## 429           0     135            94            46     145 40.6
## 430           1      95            82            25     180 35.0
## 431           2      99            60            17      74 22.2
## 432           3      89            74            16      85 30.4
## 433           1      80            74            11      60 30.0
## 434           2     139            75            22     110 25.6
## 435           1      90            68             8      70 24.5
## 436           0     141            75            40     230 42.4
## 437          12     140            85            33     108 37.4
## 438           5     147            75            27     126 29.9
## 439           1      97            70            15      46 18.2
## 440           6     107            88            33     230 36.8
## 441           0     189           104            25     145 34.3
## 442           2      83            66            23      50 32.2
## 443           4     117            64            27     120 33.2
## 444           8     108            70            31     156 30.5
## 445           4     117            62            12     115 29.7
## 446           0     180            78            63      14 59.4
## 447           1     100            72            12      70 25.3
## 448           0      95            80            45      92 36.5
## 449           0     104            64            37      64 33.6
## 450           0     120            74            18      63 30.5
## 451           1      82            64            13      95 21.2
## 452           2     134            70            30     190 28.9
## 453           0      91            68            32     210 39.9
## 454           2     119            74            26      73 19.6
## 455           2     100            54            28     105 37.8
## 456          14     175            62            30     132 33.6
## 457           1     135            54            26     152 26.7
## 458           5      86            68            28      71 30.2
## 459          10     148            84            48     237 37.6
## 460           9     134            74            33      60 25.9
## 461           9     120            72            22      56 20.8
## 462           1      71            62            18      41 21.8
## 463           8      74            70            40      49 35.3
## 464           5      88            78            30      68 27.6
## 465          10     115            98            28     110 24.0
## 466           0     124            56            13     105 21.8
## 467           0      74            52            10      36 27.8
## 468           0      97            64            36     100 36.8
## 469           8     120            74            32     130 30.0
## 470           6     154            78            41     140 46.1
## 471           1     144            82            40     194 41.3
## 472           0     137            70            38     135 33.2
## 473           0     119            66            27     142 38.8
## 474           7     136            90            31     135 29.9
## 475           4     114            64            22     120 28.9
## 476           0     137            84            27     120 27.3
## 477           2     105            80            45     191 33.7
## 478           7     114            76            17     110 23.8
## 479           8     126            74            38      75 25.9
## 480           4     132            86            31     135 28.0
## 481           3     158            70            30     328 35.5
## 482           0     123            88            37     105 35.2
## 483           4      85            58            22      49 27.8
## 484           0      84            82            31     125 38.2
## 485           0     145            80            36     220 44.2
## 486           0     135            68            42     250 42.3
## 487           1     139            62            41     480 40.7
## 488           0     173            78            32     265 46.5
## 489           4      99            72            17      51 25.6
## 490           8     194            80            31     135 26.1
## 491           2      83            65            28      66 36.8
## 492           2      89            90            30     100 33.5
## 493           4      99            68            38      88 32.8
## 494           4     125            70            18     122 28.9
## 495           3      80            78            32      56 32.0
## 496           6     166            74            27     168 26.6
## 497           5     110            68            27     100 26.0
## 498           2      81            72            15      76 30.1
## 499           7     195            70            33     145 25.1
## 500           6     154            74            32     193 29.3
## 501           2     117            90            19      71 25.2
## 502           3      84            72            32      77 37.2
## 503           6     144            68            41     215 39.0
## 504           7      94            64            25      79 33.3
## 505           3      96            78            39     105 37.3
## 506          10      75            82            30      49 33.3
## 507           0     180            90            26      90 36.5
## 508           1     130            60            23     170 28.6
## 509           2      84            50            23      76 30.4
## 510           8     120            78            26      60 25.0
## 511          12      84            72            31     175 29.7
## 512           0     139            62            17     210 22.1
## 513           9      91            68            18     126 24.2
## 514           2      91            62            23      50 27.3
## 515           3      99            54            19      86 25.6
## 516           3     163            70            18     105 31.6
## 517           9     145            88            34     165 30.3
## 518           7     125            86            30     108 37.6
## 519          13      76            60            37     105 32.8
## 520           6     129            90             7     326 19.6
## 521           2      68            70            32      66 25.0
## 522           3     124            80            33     130 33.2
## 523           6     114            92            36     170 34.7
## 524           9     130            70            35     144 34.2
## 525           3     125            58            24     158 31.6
## 526           3      87            60            18      58 21.8
## 527           1      97            64            19      82 18.2
## 528           3     116            74            15     105 26.3
## 529           0     117            66            31     188 30.8
## 530           0     111            65            22      73 24.6
## 531           2     122            60            18     106 29.8
## 532           0     107            76            32     148 45.3
## 533           1      86            66            52      65 41.3
## 534           6      91            78            28      71 29.8
## 535           1      77            56            30      56 33.3
## 536           4     132            62            35     135 32.9
## 537           0     105            90            27     105 29.6
## 538           0      57            60            20      56 21.7
## 539           0     127            80            37     210 36.3
## 540           3     129            92            49     155 36.4
## 541           8     100            74            40     215 39.4
## 542           3     128            72            25     190 32.4
## 543          10      90            85            32     165 34.9
## 544           4      84            90            23      56 39.5
## 545           1      88            78            29      76 32.0
## 546           8     186            90            35     225 34.5
## 547           5     187            76            27     207 43.6
## 548           4     131            68            21     166 33.1
## 549           1     164            82            43      67 32.8
## 550           4     189           110            31     130 28.5
## 551           1     116            70            28     110 27.4
## 552           3      84            68            30     106 31.9
## 553           6     114            88            18     155 27.8
## 554           1      88            62            24      44 29.9
## 555           1      84            64            23     115 36.9
## 556           7     124            70            33     215 25.5
## 557           1      97            70            40      90 38.1
## 558           8     110            76            18     135 27.8
## 559          11     103            68            40      94 46.2
## 560          11      85            74            27     105 30.1
## 561           6     125            76            32     370 33.8
## 562           0     198            66            32     274 41.3
## 563           1      87            68            34      77 37.6
## 564           6      99            60            19      54 26.9
## 565           0      91            80            31     100 32.4
## 566           2      95            54            14      88 26.1
## 567           1      99            72            30      18 38.6
## 568           6      92            62            32     126 32.0
## 569           4     154            72            29     126 31.3
## 570           0     121            66            30     165 34.3
## 571           3      78            70            32      55 32.5
## 572           2     130            96            22     110 22.6
## 573           3     111            58            31      44 29.5
## 574           2      98            60            17     120 34.7
## 575           1     143            86            30     330 30.1
## 576           1     119            44            47      63 35.5
## 577           6     108            44            20     130 24.0
## 578           2     118            80            35     182 42.9
## 579          10     133            68            31     122 27.0
## 580           2     197            70            99     495 34.7
## 581           0     151            90            46     230 42.1
## 582           6     109            60            27      64 25.0
## 583          12     121            78            17     110 26.5
## 584           8     100            76            39     105 38.7
## 585           8     124            76            24     600 28.7
## 586           1      93            56            11      58 22.5
## 587           8     143            66            36     304 34.9
## 588           6     103            66            27      68 24.3
## 589           3     176            86            27     156 33.3
## 590           0      73            62            17      41 21.1
## 591          11     111            84            40     215 46.8
## 592           2     112            78            50     140 39.4
## 593           3     132            80            32     140 34.4
## 594           2      82            52            22     115 28.5
## 595           6     123            72            45     230 33.6
## 596           0     188            82            14     185 32.0
## 597           0      67            76            32     125 45.3
## 598           1      89            24            19      25 27.8
## 599           1     173            74            31     180 36.8
## 600           1     109            38            18     120 23.1
## 601           1     108            88            19      84 27.1
## 602           6      96            74            27     100 23.7
## 603           1     124            74            36     110 27.8
## 604           7     150            78            29     126 35.2
## 605           4     183            66            28     180 28.4
## 606           1     124            60            32     176 35.8
## 607           1     181            78            42     293 40.0
## 608           1      92            62            25      41 19.5
## 609           0     152            82            39     272 41.5
## 610           1     111            62            13     182 24.0
## 611           3     106            54            21     158 30.9
## 612           3     174            58            22     194 32.9
## 613           7     168            88            42     321 38.2
## 614           6     105            80            28      82 32.5
## 615          11     138            74            26     144 36.1
## 616           3     106            72            22     100 25.8
## 617           6     117            96            38     100 28.7
## 618           2      68            62            13      15 20.1
## 619           9     112            82            24     155 28.2
## 620           0     119            70            30      74 32.4
## 621           2     112            86            42     160 38.4
## 622           2      92            76            20      81 24.2
## 623           6     183            94            31     193 40.8
## 624           0      94            70            27     115 43.5
## 625           2     108            64            23      94 30.8
## 626           4      90            88            47      54 37.7
## 627           0     125            68            22     148 24.7
## 628           0     132            78            26     188 32.4
## 629           5     128            80            39     105 34.6
## 630           4      94            65            22      74 24.7
## 631           7     114            64            29     156 27.4
## 632           0     102            78            40      90 34.5
## 633           2     111            60            23     116 26.2
## 634           1     128            82            17     183 27.5
## 635          10      92            62            27      54 25.9
## 636          13     104            72            31     130 31.2
## 637           5     104            74            30     105 28.8
## 638           2      94            76            18      66 31.6
## 639           7      97            76            32      91 40.9
## 640           1     100            74            12      46 19.5
## 641           0     102            86            17     105 29.3
## 642           4     128            70            32     130 34.3
## 643           6     147            80            31     285 29.5
## 644           4      90            66            23      54 28.0
## 645           3     103            72            30     152 27.6
## 646           2     157            74            35     440 39.4
## 647           1     167            74            17     144 23.4
## 648           0     179            50            36     159 37.8
## 649          11     136            84            35     130 28.3
## 650           0     107            60            25     116 26.4
## 651           1      91            54            25     100 25.2
## 652           1     117            60            23     106 33.8
## 653           5     123            74            40      77 34.1
## 654           2     120            54            22     106 26.8
## 655           1     106            70            28     135 34.2
## 656           2     155            52            27     540 38.7
## 657           2     101            58            35      90 21.8
## 658           1     120            80            48     200 38.9
## 659          11     127           106            33     105 39.0
## 660           3      80            82            31      70 34.2
## 661          10     162            84            31     110 27.7
## 662           1     199            76            43     274 42.9
## 663           8     167           106            46     231 37.6
## 664           9     145            80            46     130 37.9
## 665           6     115            60            39     125 33.7
## 666           1     112            80            45     132 34.8
## 667           4     145            82            18     175 32.5
## 668          10     111            70            27     130 27.5
## 669           6      98            58            33     190 34.0
## 670           9     154            78            30     100 30.9
## 671           6     165            68            26     168 33.6
## 672           1      99            58            10      94 25.4
## 673          10      68           106            23      49 35.5
## 674           3     123           100            35     240 57.3
## 675           8      91            82            26     108 35.6
## 676           6     195            70            28     200 30.9
## 677           9     156            86            32     145 24.8
## 678           0      93            60            32      87 35.3
## 679           3     121            52            35     129 36.0
## 680           2     101            58            17     265 24.2
## 681           2      56            56            28      45 24.2
## 682           0     162            76            36     130 49.6
## 683           0      95            64            39     105 44.6
## 684           4     125            80            30     160 32.3
## 685           5     136            82            26     135 28.0
## 686           2     129            74            26     205 33.2
## 687           3     130            64            22     210 23.1
## 688           1     107            50            19     100 28.3
## 689           1     140            74            26     180 24.1
## 690           1     144            82            46     180 46.1
## 691           8     107            80            28     110 24.6
## 692          13     158           114            32     146 42.3
## 693           2     121            70            32      95 39.1
## 694           7     129            68            49     125 38.5
## 695           2      90            60            22      74 23.5
## 696           7     142            90            24     480 30.4
## 697           3     169            74            19     125 29.9
## 698           0      99            62            22      94 25.0
## 699           4     127            88            11     155 34.5
## 700           4     118            70            32     135 44.5
## 701           2     122            76            27     200 35.9
## 702           6     125            78            31     175 27.6
## 703           1     168            88            29     156 35.0
## 704           2     129            86            37     105 38.5
## 705           4     110            76            20     100 28.4
## 706           6      80            80            36      54 39.8
## 707          10     115            96            36     175 34.2
## 708           2     127            46            21     335 34.4
## 709           9     164            78            32     132 32.8
## 710           2      93            64            32     160 38.0
## 711           3     158            64            13     387 31.2
## 712           5     126            78            27      22 29.6
## 713          10     129            62            36     130 41.2
## 714           0     134            58            20     291 26.4
## 715           3     102            74            27     105 29.5
## 716           7     187            50            33     392 33.9
## 717           3     173            78            39     185 33.8
## 718          10      94            72            18     110 23.1
## 719           1     108            60            46     178 35.5
## 720           5      97            76            27     180 35.6
## 721           4      83            86            19      66 29.3
## 722           1     114            66            36     200 38.1
## 723           1     149            68            29     127 29.3
## 724           5     117            86            30     105 39.1
## 725           1     111            94            30     160 32.8
## 726           4     112            78            40     105 39.4
## 727           1     116            78            29     180 36.1
## 728           0     141            84            26     205 32.4
## 729           2     175            88            25      71 22.9
## 730           2      92            52            23      86 30.1
## 731           3     130            78            23      79 28.4
## 732           8     120            86            27     115 28.4
## 733           2     174            88            37     120 44.5
## 734           2     106            56            27     165 29.0
## 735           2     105            75            20      87 23.3
## 736           4      95            60            32      83 35.4
## 737           0     126            86            27     120 27.4
## 738           8      65            72            23      71 32.0
## 739           2      99            60            17     160 36.6
## 740           1     102            74            32     145 39.5
## 741          11     120            80            37     150 42.3
## 742           3     102            44            20      94 30.8
## 743           1     109            58            18     116 28.5
## 744           9     140            94            35     146 32.7
## 745          13     153            88            37     140 40.6
## 746          12     100            84            33     105 30.0
## 747           1     147            94            41     220 49.3
## 748           1      81            74            41      57 46.3
## 749           3     187            70            22     200 36.4
## 750           6     162            62            35     175 24.3
## 751           4     136            70            29     190 31.2
## 752           1     121            78            39      74 39.0
## 753           3     108            62            24      86 26.0
## 754           0     181            88            44     510 43.3
## 755           8     154            78            32     210 32.4
## 756           1     128            88            39     110 36.5
## 757           7     137            90            41      94 32.0
## 758           0     123            72            35     145 36.3
## 759           1     106            76            32      90 37.5
## 760           6     190            92            33     225 35.5
## 761           2      88            58            26      16 28.4
## 762           9     170            74            31     225 44.0
## 763           9      89            62            27      54 22.5
## 764          10     101            76            48     180 32.9
## 765           2     122            70            27     180 36.8
## 766           5     121            72            23     112 26.2
## 767           1     126            60            31     140 30.1
## 768           1      93            70            31      44 30.4
##     DiabetesPedigreeFunction Age Outcome
## 1                      0.627  50       1
## 2                      0.351  31       0
## 3                      0.672  32       1
## 4                      0.167  21       0
## 5                      2.288  33       1
## 6                      0.201  30       0
## 7                      0.248  26       1
## 8                      0.134  29       0
## 9                      0.158  53       1
## 10                     0.232  54       1
## 11                     0.191  30       0
## 12                     0.537  34       1
## 13                     1.441  57       0
## 14                     0.398  59       1
## 15                     0.587  51       1
## 16                     0.484  32       1
## 17                     0.551  31       1
## 18                     0.254  31       1
## 19                     0.183  33       0
## 20                     0.529  32       1
## 21                     0.704  27       0
## 22                     0.388  50       0
## 23                     0.451  41       1
## 24                     0.263  29       1
## 25                     0.254  51       1
## 26                     0.205  41       1
## 27                     0.257  43       1
## 28                     0.487  22       0
## 29                     0.245  57       0
## 30                     0.337  38       0
## 31                     0.546  60       0
## 32                     0.851  28       1
## 33                     0.267  22       0
## 34                     0.188  28       0
## 35                     0.512  45       0
## 36                     0.966  33       0
## 37                     0.420  35       0
## 38                     0.665  46       1
## 39                     0.503  27       1
## 40                     1.390  56       1
## 41                     0.271  26       0
## 42                     0.696  37       0
## 43                     0.235  48       0
## 44                     0.721  54       1
## 45                     0.294  40       0
## 46                     1.893  25       1
## 47                     0.564  29       0
## 48                     0.586  22       0
## 49                     0.344  31       1
## 50                     0.305  24       0
## 51                     0.491  22       0
## 52                     0.526  26       0
## 53                     0.342  30       0
## 54                     0.467  58       1
## 55                     0.718  42       0
## 56                     0.248  21       0
## 57                     0.254  41       1
## 58                     0.962  31       0
## 59                     1.781  44       0
## 60                     0.173  22       0
## 61                     0.304  21       0
## 62                     0.270  39       1
## 63                     0.587  36       0
## 64                     0.699  24       0
## 65                     0.258  42       1
## 66                     0.203  32       0
## 67                     0.855  38       1
## 68                     0.845  54       0
## 69                     0.334  25       0
## 70                     0.189  27       0
## 71                     0.867  28       1
## 72                     0.411  26       0
## 73                     0.583  42       1
## 74                     0.231  23       0
## 75                     0.396  22       0
## 76                     0.140  22       0
## 77                     0.391  41       0
## 78                     0.370  27       0
## 79                     0.270  26       1
## 80                     0.307  24       0
## 81                     0.140  22       0
## 82                     0.102  22       0
## 83                     0.767  36       0
## 84                     0.237  22       0
## 85                     0.227  37       1
## 86                     0.698  27       0
## 87                     0.178  45       0
## 88                     0.324  26       0
## 89                     0.153  43       1
## 90                     0.165  24       0
## 91                     0.258  21       0
## 92                     0.443  34       0
## 93                     0.261  42       0
## 94                     0.277  60       1
## 95                     0.761  21       0
## 96                     0.255  40       0
## 97                     0.130  24       0
## 98                     0.323  22       0
## 99                     0.356  23       0
## 100                    0.325  31       1
## 101                    1.222  33       1
## 102                    0.179  22       0
## 103                    0.262  21       0
## 104                    0.283  24       0
## 105                    0.930  27       0
## 106                    0.801  21       0
## 107                    0.207  27       0
## 108                    0.287  37       0
## 109                    0.336  25       0
## 110                    0.247  24       1
## 111                    0.199  24       1
## 112                    0.543  46       1
## 113                    0.192  23       0
## 114                    0.391  25       0
## 115                    0.588  39       1
## 116                    0.539  61       1
## 117                    0.220  38       1
## 118                    0.654  25       0
## 119                    0.443  22       0
## 120                    0.223  21       0
## 121                    0.759  25       1
## 122                    0.260  24       0
## 123                    0.404  23       0
## 124                    0.186  69       0
## 125                    0.278  23       1
## 126                    0.496  26       1
## 127                    0.452  30       0
## 128                    0.261  23       0
## 129                    0.403  40       1
## 130                    0.741  62       1
## 131                    0.361  33       1
## 132                    1.114  33       1
## 133                    0.356  30       1
## 134                    0.457  39       0
## 135                    0.647  26       0
## 136                    0.088  31       0
## 137                    0.597  21       0
## 138                    0.532  22       0
## 139                    0.703  29       0
## 140                    0.159  28       0
## 141                    0.268  55       0
## 142                    0.286  38       0
## 143                    0.318  22       0
## 144                    0.272  42       1
## 145                    0.237  23       0
## 146                    0.572  21       0
## 147                    0.096  41       0
## 148                    1.400  34       0
## 149                    0.218  65       0
## 150                    0.085  22       0
## 151                    0.399  24       0
## 152                    0.432  37       0
## 153                    1.189  42       1
## 154                    0.687  23       0
## 155                    0.137  43       1
## 156                    0.337  36       1
## 157                    0.637  21       0
## 158                    0.833  23       0
## 159                    0.229  22       0
## 160                    0.817  47       1
## 161                    0.294  36       0
## 162                    0.204  45       0
## 163                    0.167  27       0
## 164                    0.368  21       0
## 165                    0.743  32       1
## 166                    0.722  41       1
## 167                    0.256  22       0
## 168                    0.709  34       0
## 169                    0.471  29       0
## 170                    0.495  29       0
## 171                    0.180  36       1
## 172                    0.542  29       1
## 173                    0.773  25       0
## 174                    0.678  23       0
## 175                    0.370  33       0
## 176                    0.719  36       1
## 177                    0.382  42       0
## 178                    0.319  26       1
## 179                    0.190  47       0
## 180                    0.956  37       1
## 181                    0.084  32       0
## 182                    0.725  23       0
## 183                    0.299  21       0
## 184                    0.268  27       0
## 185                    0.244  40       0
## 186                    0.745  41       1
## 187                    0.615  60       1
## 188                    1.321  33       1
## 189                    0.640  31       1
## 190                    0.361  25       1
## 191                    0.142  21       0
## 192                    0.374  40       0
## 193                    0.383  36       1
## 194                    0.578  40       1
## 195                    0.136  42       0
## 196                    0.395  29       1
## 197                    0.187  21       0
## 198                    0.678  23       1
## 199                    0.905  26       1
## 200                    0.150  29       1
## 201                    0.874  21       0
## 202                    0.236  28       0
## 203                    0.787  32       0
## 204                    0.235  27       0
## 205                    0.324  55       0
## 206                    0.407  27       0
## 207                    0.605  57       1
## 208                    0.151  52       1
## 209                    0.289  21       0
## 210                    0.355  41       1
## 211                    0.290  25       0
## 212                    0.375  24       0
## 213                    0.164  60       0
## 214                    0.431  24       1
## 215                    0.260  36       1
## 216                    0.742  38       1
## 217                    0.514  25       1
## 218                    0.464  32       0
## 219                    1.224  32       1
## 220                    0.261  41       1
## 221                    1.072  21       1
## 222                    0.805  66       1
## 223                    0.209  37       0
## 224                    0.687  61       0
## 225                    0.666  26       0
## 226                    0.101  22       0
## 227                    0.198  26       0
## 228                    0.652  24       1
## 229                    2.329  31       0
## 230                    0.089  24       0
## 231                    0.645  22       1
## 232                    0.238  46       1
## 233                    0.583  22       0
## 234                    0.394  29       0
## 235                    0.293  23       0
## 236                    0.479  26       1
## 237                    0.586  51       1
## 238                    0.686  23       1
## 239                    0.831  32       1
## 240                    0.582  27       0
## 241                    0.192  21       0
## 242                    0.446  22       0
## 243                    0.402  22       1
## 244                    1.318  33       1
## 245                    0.329  29       0
## 246                    1.213  49       1
## 247                    0.258  41       0
## 248                    0.427  23       0
## 249                    0.282  34       0
## 250                    0.143  23       0
## 251                    0.380  42       0
## 252                    0.284  27       0
## 253                    0.249  24       0
## 254                    0.238  25       0
## 255                    0.926  44       1
## 256                    0.543  21       1
## 257                    0.557  30       0
## 258                    0.092  25       0
## 259                    0.655  24       0
## 260                    1.353  51       1
## 261                    0.299  34       0
## 262                    0.761  27       1
## 263                    0.612  24       0
## 264                    0.200  63       0
## 265                    0.226  35       1
## 266                    0.997  43       0
## 267                    0.933  25       1
## 268                    1.101  24       0
## 269                    0.078  21       0
## 270                    0.240  28       1
## 271                    1.136  38       1
## 272                    0.128  21       0
## 273                    0.254  40       0
## 274                    0.422  21       0
## 275                    0.251  52       0
## 276                    0.677  25       0
## 277                    0.296  29       1
## 278                    0.454  23       0
## 279                    0.744  57       0
## 280                    0.881  22       0
## 281                    0.334  28       1
## 282                    0.280  39       0
## 283                    0.262  37       0
## 284                    0.165  47       1
## 285                    0.259  52       1
## 286                    0.647  51       0
## 287                    0.619  34       0
## 288                    0.808  29       1
## 289                    0.340  26       0
## 290                    0.263  33       0
## 291                    0.434  21       0
## 292                    0.757  25       1
## 293                    1.224  31       1
## 294                    0.613  24       1
## 295                    0.254  65       0
## 296                    0.692  28       0
## 297                    0.337  29       1
## 298                    0.520  24       0
## 299                    0.412  46       1
## 300                    0.840  58       0
## 301                    0.839  30       1
## 302                    0.422  25       1
## 303                    0.156  35       0
## 304                    0.209  28       1
## 305                    0.207  37       0
## 306                    0.215  29       0
## 307                    0.326  47       1
## 308                    0.143  21       0
## 309                    1.391  25       1
## 310                    0.875  30       1
## 311                    0.313  41       0
## 312                    0.605  22       0
## 313                    0.433  27       1
## 314                    0.626  25       0
## 315                    1.127  43       1
## 316                    0.315  26       0
## 317                    0.284  30       0
## 318                    0.345  29       1
## 319                    0.150  28       0
## 320                    0.129  59       1
## 321                    0.527  31       0
## 322                    0.197  25       1
## 323                    0.254  36       1
## 324                    0.731  43       1
## 325                    0.148  21       0
## 326                    0.123  24       0
## 327                    0.692  30       1
## 328                    0.200  37       0
## 329                    0.127  23       1
## 330                    0.122  37       0
## 331                    1.476  46       0
## 332                    0.166  25       0
## 333                    0.282  41       1
## 334                    0.137  44       0
## 335                    0.260  22       0
## 336                    0.259  26       0
## 337                    0.932  44       0
## 338                    0.343  44       1
## 339                    0.893  33       1
## 340                    0.331  41       1
## 341                    0.472  22       0
## 342                    0.673  36       0
## 343                    0.389  22       0
## 344                    0.290  33       0
## 345                    0.485  57       0
## 346                    0.349  49       0
## 347                    0.654  22       0
## 348                    0.187  23       0
## 349                    0.279  26       0
## 350                    0.346  37       1
## 351                    0.237  29       0
## 352                    0.252  30       0
## 353                    0.243  46       0
## 354                    0.580  24       0
## 355                    0.559  21       0
## 356                    0.302  49       1
## 357                    0.962  28       1
## 358                    0.569  44       1
## 359                    0.378  48       0
## 360                    0.875  29       1
## 361                    0.583  29       1
## 362                    0.207  63       0
## 363                    0.305  65       0
## 364                    0.520  67       1
## 365                    0.385  30       0
## 366                    0.499  30       0
## 367                    0.368  29       1
## 368                    0.252  21       0
## 369                    0.306  22       0
## 370                    0.234  45       1
## 371                    2.137  25       1
## 372                    1.731  21       0
## 373                    0.545  21       0
## 374                    0.225  25       0
## 375                    0.816  28       0
## 376                    0.528  58       1
## 377                    0.299  22       0
## 378                    0.509  22       0
## 379                    0.238  32       1
## 380                    1.021  35       0
## 381                    0.821  24       0
## 382                    0.236  22       0
## 383                    0.947  21       0
## 384                    1.268  25       0
## 385                    0.221  25       0
## 386                    0.205  24       0
## 387                    0.660  35       1
## 388                    0.239  45       1
## 389                    0.452  58       1
## 390                    0.949  28       0
## 391                    0.444  42       0
## 392                    0.340  27       1
## 393                    0.389  21       0
## 394                    0.463  37       0
## 395                    0.803  31       1
## 396                    1.600  25       0
## 397                    0.944  39       0
## 398                    0.196  22       1
## 399                    0.389  25       0
## 400                    0.241  25       1
## 401                    0.161  31       1
## 402                    0.151  55       0
## 403                    0.286  35       1
## 404                    0.280  38       0
## 405                    0.135  41       1
## 406                    0.520  26       0
## 407                    0.376  46       1
## 408                    0.336  25       0
## 409                    1.191  39       1
## 410                    0.702  28       1
## 411                    0.674  28       0
## 412                    0.528  25       0
## 413                    1.076  22       0
## 414                    0.256  21       0
## 415                    0.534  21       1
## 416                    0.258  22       1
## 417                    1.095  22       0
## 418                    0.554  37       1
## 419                    0.624  27       0
## 420                    0.219  28       1
## 421                    0.507  26       0
## 422                    0.561  21       0
## 423                    0.496  21       0
## 424                    0.421  21       0
## 425                    0.516  36       1
## 426                    0.264  31       1
## 427                    0.256  25       0
## 428                    0.328  38       1
## 429                    0.284  26       0
## 430                    0.233  43       1
## 431                    0.108  23       0
## 432                    0.551  38       0
## 433                    0.527  22       0
## 434                    0.167  29       0
## 435                    1.138  36       0
## 436                    0.205  29       1
## 437                    0.244  41       0
## 438                    0.434  28       0
## 439                    0.147  21       0
## 440                    0.727  31       0
## 441                    0.435  41       1
## 442                    0.497  22       0
## 443                    0.230  24       0
## 444                    0.955  33       1
## 445                    0.380  30       1
## 446                    2.420  25       1
## 447                    0.658  28       0
## 448                    0.330  26       0
## 449                    0.510  22       1
## 450                    0.285  26       0
## 451                    0.415  23       0
## 452                    0.542  23       1
## 453                    0.381  25       0
## 454                    0.832  72       0
## 455                    0.498  24       0
## 456                    0.212  38       1
## 457                    0.687  62       0
## 458                    0.364  24       0
## 459                    1.001  51       1
## 460                    0.460  81       0
## 461                    0.733  48       0
## 462                    0.416  26       0
## 463                    0.705  39       0
## 464                    0.258  37       0
## 465                    1.022  34       0
## 466                    0.452  21       0
## 467                    0.269  22       0
## 468                    0.600  25       0
## 469                    0.183  38       1
## 470                    0.571  27       0
## 471                    0.607  28       0
## 472                    0.170  22       0
## 473                    0.259  22       0
## 474                    0.210  50       0
## 475                    0.126  24       0
## 476                    0.231  59       0
## 477                    0.711  29       1
## 478                    0.466  31       0
## 479                    0.162  39       0
## 480                    0.419  63       0
## 481                    0.344  35       1
## 482                    0.197  29       0
## 483                    0.306  28       0
## 484                    0.233  23       0
## 485                    0.630  31       1
## 486                    0.365  24       1
## 487                    0.536  21       0
## 488                    1.159  58       0
## 489                    0.294  28       0
## 490                    0.551  67       0
## 491                    0.629  24       0
## 492                    0.292  42       0
## 493                    0.145  33       0
## 494                    1.144  45       1
## 495                    0.174  22       0
## 496                    0.304  66       0
## 497                    0.292  30       0
## 498                    0.547  25       0
## 499                    0.163  55       1
## 500                    0.839  39       0
## 501                    0.313  21       0
## 502                    0.267  28       0
## 503                    0.727  41       1
## 504                    0.738  41       0
## 505                    0.238  40       0
## 506                    0.263  38       0
## 507                    0.314  35       1
## 508                    0.692  21       0
## 509                    0.968  21       0
## 510                    0.409  64       0
## 511                    0.297  46       1
## 512                    0.207  21       0
## 513                    0.200  58       0
## 514                    0.525  22       0
## 515                    0.154  24       0
## 516                    0.268  28       1
## 517                    0.771  53       1
## 518                    0.304  51       0
## 519                    0.180  41       0
## 520                    0.582  60       0
## 521                    0.187  25       0
## 522                    0.305  26       0
## 523                    0.189  26       0
## 524                    0.652  45       1
## 525                    0.151  24       0
## 526                    0.444  21       0
## 527                    0.299  21       0
## 528                    0.107  24       0
## 529                    0.493  22       0
## 530                    0.660  31       0
## 531                    0.717  22       0
## 532                    0.686  24       0
## 533                    0.917  29       0
## 534                    0.501  31       0
## 535                    1.251  24       0
## 536                    0.302  23       1
## 537                    0.197  46       0
## 538                    0.735  67       0
## 539                    0.804  23       0
## 540                    0.968  32       1
## 541                    0.661  43       1
## 542                    0.549  27       1
## 543                    0.825  56       1
## 544                    0.159  25       0
## 545                    0.365  29       0
## 546                    0.423  37       1
## 547                    1.034  53       1
## 548                    0.160  28       0
## 549                    0.341  50       0
## 550                    0.680  37       0
## 551                    0.204  21       0
## 552                    0.591  25       0
## 553                    0.247  66       0
## 554                    0.422  23       0
## 555                    0.471  28       0
## 556                    0.161  37       0
## 557                    0.218  30       0
## 558                    0.237  58       0
## 559                    0.126  42       0
## 560                    0.300  35       0
## 561                    0.121  54       1
## 562                    0.502  28       1
## 563                    0.401  24       0
## 564                    0.497  32       0
## 565                    0.601  27       0
## 566                    0.748  22       0
## 567                    0.412  21       0
## 568                    0.085  46       0
## 569                    0.338  37       0
## 570                    0.203  33       1
## 571                    0.270  39       0
## 572                    0.268  21       0
## 573                    0.430  22       0
## 574                    0.198  22       0
## 575                    0.892  23       0
## 576                    0.280  25       0
## 577                    0.813  35       0
## 578                    0.693  21       1
## 579                    0.245  36       0
## 580                    0.575  62       1
## 581                    0.371  21       1
## 582                    0.206  27       0
## 583                    0.259  62       0
## 584                    0.190  42       0
## 585                    0.687  52       1
## 586                    0.417  22       0
## 587                    0.129  41       1
## 588                    0.249  29       0
## 589                    1.154  52       1
## 590                    0.342  25       0
## 591                    0.925  45       1
## 592                    0.175  24       0
## 593                    0.402  44       1
## 594                    1.699  25       0
## 595                    0.733  34       0
## 596                    0.682  22       1
## 597                    0.194  46       0
## 598                    0.559  21       0
## 599                    0.088  38       1
## 600                    0.407  26       0
## 601                    0.400  24       0
## 602                    0.190  28       0
## 603                    0.100  30       0
## 604                    0.692  54       1
## 605                    0.212  36       1
## 606                    0.514  21       0
## 607                    1.258  22       1
## 608                    0.482  25       0
## 609                    0.270  27       0
## 610                    0.138  23       0
## 611                    0.292  24       0
## 612                    0.593  36       1
## 613                    0.787  40       1
## 614                    0.878  26       0
## 615                    0.557  50       1
## 616                    0.207  27       0
## 617                    0.157  30       0
## 618                    0.257  23       0
## 619                    1.282  50       1
## 620                    0.141  24       1
## 621                    0.246  28       0
## 622                    1.698  28       0
## 623                    1.461  45       0
## 624                    0.347  21       0
## 625                    0.158  21       0
## 626                    0.362  29       0
## 627                    0.206  21       0
## 628                    0.393  21       0
## 629                    0.144  45       0
## 630                    0.148  21       0
## 631                    0.732  34       1
## 632                    0.238  24       0
## 633                    0.343  23       0
## 634                    0.115  22       0
## 635                    0.167  31       0
## 636                    0.465  38       1
## 637                    0.153  48       0
## 638                    0.649  23       0
## 639                    0.871  32       1
## 640                    0.149  28       0
## 641                    0.695  27       0
## 642                    0.303  24       0
## 643                    0.178  50       1
## 644                    0.610  31       0
## 645                    0.730  27       0
## 646                    0.134  30       0
## 647                    0.447  33       1
## 648                    0.455  22       1
## 649                    0.260  42       1
## 650                    0.133  23       0
## 651                    0.234  23       0
## 652                    0.466  27       0
## 653                    0.269  28       0
## 654                    0.455  27       0
## 655                    0.142  22       0
## 656                    0.240  25       1
## 657                    0.155  22       0
## 658                    1.162  41       0
## 659                    0.190  51       0
## 660                    1.292  27       1
## 661                    0.182  54       0
## 662                    1.394  22       1
## 663                    0.165  43       1
## 664                    0.637  40       1
## 665                    0.245  40       1
## 666                    0.217  24       0
## 667                    0.235  70       1
## 668                    0.141  40       1
## 669                    0.430  43       0
## 670                    0.164  45       0
## 671                    0.631  49       0
## 672                    0.551  21       0
## 673                    0.285  47       0
## 674                    0.880  22       0
## 675                    0.587  68       0
## 676                    0.328  31       1
## 677                    0.230  53       1
## 678                    0.263  25       0
## 679                    0.127  25       1
## 680                    0.614  23       0
## 681                    0.332  22       0
## 682                    0.364  26       1
## 683                    0.366  22       0
## 684                    0.536  27       1
## 685                    0.640  69       0
## 686                    0.591  25       0
## 687                    0.314  22       0
## 688                    0.181  29       0
## 689                    0.828  23       0
## 690                    0.335  46       1
## 691                    0.856  34       0
## 692                    0.257  44       1
## 693                    0.886  23       0
## 694                    0.439  43       1
## 695                    0.191  25       0
## 696                    0.128  43       1
## 697                    0.268  31       1
## 698                    0.253  22       0
## 699                    0.598  28       0
## 700                    0.904  26       0
## 701                    0.483  26       0
## 702                    0.565  49       1
## 703                    0.905  52       1
## 704                    0.304  41       0
## 705                    0.118  27       0
## 706                    0.177  28       0
## 707                    0.261  30       1
## 708                    0.176  22       0
## 709                    0.148  45       1
## 710                    0.674  23       1
## 711                    0.295  24       0
## 712                    0.439  40       0
## 713                    0.441  38       1
## 714                    0.352  21       0
## 715                    0.121  32       0
## 716                    0.826  34       1
## 717                    0.970  31       1
## 718                    0.595  56       0
## 719                    0.415  24       0
## 720                    0.378  52       1
## 721                    0.317  34       0
## 722                    0.289  21       0
## 723                    0.349  42       1
## 724                    0.251  42       0
## 725                    0.265  45       0
## 726                    0.236  38       0
## 727                    0.496  25       0
## 728                    0.433  22       0
## 729                    0.326  22       0
## 730                    0.141  22       0
## 731                    0.323  34       1
## 732                    0.259  22       1
## 733                    0.646  24       1
## 734                    0.426  22       0
## 735                    0.560  53       0
## 736                    0.284  28       0
## 737                    0.515  21       0
## 738                    0.600  42       0
## 739                    0.453  21       0
## 740                    0.293  42       1
## 741                    0.785  48       1
## 742                    0.400  26       0
## 743                    0.219  22       0
## 744                    0.734  45       1
## 745                    1.174  39       0
## 746                    0.488  46       0
## 747                    0.358  27       1
## 748                    1.096  32       0
## 749                    0.408  36       1
## 750                    0.178  50       1
## 751                    1.182  22       1
## 752                    0.261  28       0
## 753                    0.223  25       0
## 754                    0.222  26       1
## 755                    0.443  45       1
## 756                    1.057  37       1
## 757                    0.391  39       0
## 758                    0.258  52       1
## 759                    0.197  26       0
## 760                    0.278  66       1
## 761                    0.766  22       0
## 762                    0.403  43       1
## 763                    0.142  33       0
## 764                    0.171  63       0
## 765                    0.340  27       0
## 766                    0.245  30       0
## 767                    0.349  47       1
## 768                    0.315  23       0

Handling Missing Values

Finally, the dataset is checked again for any remaining missing values. This step ensures that the data is complete and ready for exploratory analysis and modeling. Any missing values that persist after imputation may need further investigation or handling, depending on the specific requirements of the analysis.

# Check for missing values
missing_values <- sapply(clean_data, function(x) sum(is.na(x)))
print(missing_values)
##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0
# Glimpse data
glimpse(clean_data)
## Rows: 768
## Columns: 9
## $ Pregnancies              <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
## $ Glucose                  <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
## $ BloodPressure            <dbl> 72, 66, 64, 66, 40, 74, 50, 68, 70, 96, 92, 7…
## $ SkinThickness            <dbl> 35, 29, 28, 23, 35, 27, 32, 39, 45, 36, 38, 3…
## $ Insulin                  <dbl> 175, 55, 325, 94, 168, 112, 88, 122, 543, 150…
## $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
## $ Age                      <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
## $ Outcome                  <fct> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) was utilized to simplify the complexity of high-dimensional data while preserving its essential features. This section explains how PCA was applied to our dataset and interprets its results.

Understanding PCA

PCA transformed correlated variables into a set of linearly uncorrelated components, known as principal components (PCs). These components were ordered by the amount of variance they explained in the data, with the first PC explaining the maximum variance and each subsequent PC explaining less.

Performing PCA

Initially, the dataset, including dummy variables for the outcome categories, was centered and scaled. This normalization step ensured that each variable contributed equally to the analysis, regardless of its original scale or units.

Explained Variance

The PCA results included a summary of the variance explained by each principal component. This information helped in understanding how much information each PC retained from the original dataset. It enabled us to decide how many principal components to retain based on the cumulative variance explained.

# Convert Outcome to dummy variables
clean_data$Outcome_0 <- ifelse(clean_data$Outcome == 0, 1, 0)
clean_data$Outcome_1 <- ifelse(clean_data$Outcome == 1, 1, 0)

# Remove the original Outcome column
clean_data <- clean_data[, !names(clean_data) %in% "Outcome"]

# Perform PCA on your data including dummy variables
pc <- prcomp(clean_data, center = TRUE, scale. = TRUE)

# Summary of the PCA results
summary(pc)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6    PC7
## Standard deviation     1.8618 1.2280 1.1425 0.97142 0.95358 0.85786 0.6512
## Proportion of Variance 0.3466 0.1508 0.1305 0.09437 0.09093 0.07359 0.0424
## Cumulative Proportion  0.3466 0.4974 0.6280 0.72233 0.81326 0.88685 0.9293
##                            PC8     PC9      PC10
## Standard deviation     0.61864 0.56984 1.692e-16
## Proportion of Variance 0.03827 0.03247 0.000e+00
## Cumulative Proportion  0.96753 1.00000 1.000e+00
# Percentage of variance explained by each principal component
pc_var <- pc$sdev^2 / sum(pc$sdev^2) * 100

# Cumulative variance explained by principal components
pc_cumvar <- cumsum(pc_var)

# Plot the variance explained by each principal component
barplot(pc_var, main = "Variance Explained by Principal Components",
        xlab = "Principal Component", ylab = "Percentage of Variance Explained")

# Plot the cumulative variance explained by principal components
plot(pc_cumvar, type = "b", main = "Cumulative Variance Explained by Principal Components",
     xlab = "Number of Principal Components", ylab = "Cumulative Percentage of Variance Explained")

### Cumulative Variance Plot Analysis

Number of Principal Components (PCs)

The x-axis of the plot represents the number of principal components included in the analysis. It starts from 1 (the first principal component) and extends to the total number of features in the dataset.

Cumulative Percentage of Variance Explained

On the y-axis, you’ll find the cumulative percentage of variance explained by the selected principal components. As you add more principal components, this percentage increases, reflecting the total amount of variance captured by those components. This plot is crucial for deciding the number of components needed to explain a high percentage of the dataset’s variance effectively, typically aiming for 90% or more.

Curve Behavior

The curve on the plot begins at the bottom left corner, near 0% explained variance, and ascends rapidly at the start. This steep ascent indicates that the initial principal components account for a significant portion of the variance in the data. As more components are added, the curve gradually flattens out.

Summary

The cumulative explained variance plot in PCA guides the selection of the optimal number of principal components by showing how much variance is captured as components are added. It balances the trade-off between retaining enough information for analysis while avoiding the inclusion of redundant components that do not significantly contribute to explaining the dataset’s variance.

Determining Optimal Components

Moreover, methods such as Kaiser’s criterion was used to determine the optimal number of principal components. It suggests retaining only those components with eigenvalues (variance explained by each component) greater than one.

# Applying Kaiser's Criterion to PCA Results

# Calculate eigenvalues
eigenvalues <- pc$sdev^2

# Print the eigenvalues
print(eigenvalues)
##  [1] 3.466357e+00 1.508074e+00 1.305217e+00 9.436612e-01 9.093126e-01
##  [6] 7.359159e-01 4.240280e-01 3.827130e-01 3.247216e-01 2.864546e-32
# Filter eigenvalues greater than 1
optimal_components <- eigenvalues[eigenvalues > 1]

# Print the eigenvalues of principal components greater than 1
print(optimal_components)
## [1] 3.466357 1.508074 1.305217

Variable Loadings

Variable loadings represented the correlation coefficients between the original variables and the principal components. These coefficients indicated the strength and direction of each variable’s contribution to the principal components. Bar plots of variable loadings visualised which variables were most influential in each principal component.

# Extract variable loadings for PC1 to PC4
variable_loadings <- as.data.frame(pc$rotation[, 1:3])

# Function to create bar plot for variable loadings
create_bar_plot <- function(pc_num) {
  bar_data <- data.frame(variable = rownames(variable_loadings),
                         loading = variable_loadings[, pc_num])
  ggplot(bar_data, aes(x = reorder(variable, loading), y = loading)) +
    geom_bar(stat = "identity", fill = "#0073C2FF") +
    coord_flip() +
    labs(title = paste("Variable Loadings for PC", pc_num),
         x = "Variable", y = "Loading")
}

# Create bar plots for variable loadings of PC1 to PC4
bar_plot_pc1 <- create_bar_plot(1)
bar_plot_pc2 <- create_bar_plot(2)
bar_plot_pc3 <- create_bar_plot(3)
#bar_plot_pc4 <- create_bar_plot(4)

# Display the plots
bar_plot_pc1

bar_plot_pc2

bar_plot_pc3

#bar_plot_pc4

PC 1 Variable Loadings Analysis

In PCA, variable loadings quantify the contribution of each original variable to the variance explained by each principal component. These loadings are crucial for understanding which variables are most influential in defining each PC.

  • Outcome: The variable “Outcome” shows the highest positive loading, indicating it strongly influences PC1.
  • Outcome_1:Thevariable “Outcome_1” presents with a highest negative loading, suggesting a significant contribution to PC1.
  • Insulin, BMI, Skin Thickness, and Age: These variables also contribute inversely to PC1, albeit to a lesser extent, in descending order of influence.
  • Blood Pressure, Diabetes Pedigree Function and Pregnancies: These variables exhibit very small negative loadings on PC1.

Summary: The sign and magnitude of loadings indicate the direction and strength of the relationship between each original variable and PC1. Higher absolute loading values signify greater influence on the variance explained by PC1. Positive loadings indicate a direct correlation, while negative loadings signify an inverse correlation with PC1.

PCA Visualisation

Visualising PCA results was crucial for interpreting the relationships between data points and understanding the distribution of variables across principal components. Scatter plots and variable representation plots (showing arrows indicating variable contributions) provided intuitive insights into how variables clustered and correlated in reduced-dimensional space.

# Extract PCA components for clustering
pca_data <- pc$x  

# Visualize PCA results (scatter plot)
pca_scatter <- fviz_pca_ind(pc, geom.ind = "point", 
                            pointshape = 21, palette = "jco", 
                            addEllipses = TRUE, ellipse.level = 0.95, 
                            repel = TRUE) +
              ggtitle("PCA Visualisation")

# Variable representation (arrows)
var_representation <- fviz_pca_var(pc, col.var = "contrib", 
                                    gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
                                    repel = TRUE, axes = c(1, 2), arrows = TRUE) + 
                      labs(title = "Variable Representation")

# Combine plots
combined_plot <- grid.arrange(pca_scatter, var_representation, ncol = 2)

# Save PCA scatter plot as standalone image
ggsave("pca_scatter_plot.png", pca_scatter, width = 8, height = 6)

# Save variable representation plot as standalone image
ggsave("variable_representation_plot.png", var_representation, width = 8, height = 6)

Determine optimal number of clusters

and statistical techniques like the elbow method, silhouette method, and gap statistics were employed. These methods helped in selecting a suitable number of clusters that best captured the variability in the data while avoiding overfitting. The number of clusters chosen were four based on these methods including average Silhouette width score that was performed after clustering.

Elbow Method

The Elbow method is a technique used in clustering algorithms, particularly K-means clustering, to determine the optimal number of clusters \(k\) in a dataset. It involves plotting the within-cluster sum of squares (WSS) against different values of \(k\). The “elbow” point on the plot represents the optimal \(k\) where the rate of decrease in WSS slows down, indicating that adding more clusters does not significantly improve the clustering performance.

# Elbow Method
elbow_method <- function(pca_data, max_k) {
  wss <- numeric(max_k)
  for (i in 1:max_k) {
    kmeans_model <- kmeans(pca_data, centers = i, nstart = 10)
    wss[i] <- sum(kmeans_model$tot.withinss)
  }
  plot(1:max_k, wss, type = "b", xlab = "Number of Clusters (k)", ylab = "Total Within Sum of Squares (WSS)", main = "Elbow Method")
}

# Call elbow_method function
elbow_method(pca_data, max_k = 10)

Silhouette Method

The Silhouette Method is a technique used to determine the optimal number of clusters \(k\) in a dataset for clustering algorithms like K-means. It evaluates how similar each point in one cluster is to points in its own cluster compared to points in other clusters.

# Silhouette Method
silhouette_method <- function(pca_data, max_k) {
  silhouette_scores <- numeric(max_k)
  for (i in 2:max_k) {
    kmeans_model <- kmeans(pca_data, centers = i, nstart = 10)
    silhouette_obj <- silhouette(kmeans_model$cluster, dist(pca_data))
    silhouette_scores[i] <- mean(silhouette_obj[, "sil_width"])
  }
  plot(2:max_k, silhouette_scores[2:max_k], type = "b", xlab = "Number of Clusters (k)", ylab = "Silhouette Score", main = "Silhouette Method")
}

# Call silhouette_method function
silhouette_method(pca_data, max_k = 10)

### Gap Statistics

Gap Statistics is a method used to determine the optimal number of clusters \(k\) in a dataset for clustering algorithms such as K-means. It compares the within-cluster variation (sum of squares) of the clustering algorithm’s output with that of a reference null distribution that represents data with no obvious clustering structure.

gap_statistics <- function(pca_data, max_k, B = 10) {
  gap <- numeric(max_k - 1)  # Initialize with length max_k - 1
  for (i in 2:max_k) {  # Start from 2 instead of 1
    print(paste("Calculating gap statistic for k =", i))
    kmeans_model <- kmeans(pca_data, centers = i, nstart = 10)
    gap_result <- clusGap(pca_data, FUNcluster = kmeans, K.max = i, B = B)
    gap[i - 1] <- max(gap_result$Tab[, "gap"])  # Use maximum gap value
  }
  plot(2:max_k, gap, type = "b", xlab = "Number of Clusters (k)", ylab = "Gap Statistic", main = "Gap Statistics")
}

# Call gap_statistics function
gap_statistics(pca_data, max_k = 10)
## [1] "Calculating gap statistic for k = 2"
## [1] "Calculating gap statistic for k = 3"
## [1] "Calculating gap statistic for k = 4"
## [1] "Calculating gap statistic for k = 5"
## [1] "Calculating gap statistic for k = 6"
## [1] "Calculating gap statistic for k = 7"
## [1] "Calculating gap statistic for k = 8"
## [1] "Calculating gap statistic for k = 9"
## [1] "Calculating gap statistic for k = 10"
## Warning: did not converge in 10 iterations

K-means Clustering with K = 4

K-Means Clustering

K-means clustering is a method used to partition a dataset into distinct groups (clusters) based on similarity. By minimising the variance within each cluster, K-means aims to create groups where the data points within each group are more similar to each other than to those in other groups. This analysis applied K-means clustering to the results of Principal Component Analysis (PCA) to understand the underlying patterns in the data.

Clustering with K=4

To explore the dataset’s structure, we performed K-means clustering using the first three principal components (PC1 to PC3). We set the number of clusters (K) to 4. The PCA results provided a reduced dimensionality space, making it easier to visualise and interpret the clusters.

## Cluster Visualization The clusters were visualised in two-dimensional and three-dimensional plots. The 2D plot displayed the data points along PC1 and PC2, colored according to their cluster assignments. A 3D scatter plot further illustrated the clustering in the PCA-reduced space, showing the separation and distribution of clusters across PC1, PC2, and PC3.

# Extract PC scores for PC1 to PC2
pc_scores <- as.data.frame(pc$x[, 1:3])

# Perform K-means clustering with K = 4 using PCA results
kmeans_result <- kmeans(pc_scores, centers = 4, nstart = 10)

# Visualize the clusters
ggplot(pc_scores, aes(x = PC1, y = PC2, color = factor(kmeans_result$cluster))) +
  geom_point() +
  scale_color_discrete(name = "Cluster") +
  labs(title = "K-means Clustering Results (K = 4)", x = "Principal Component 1", y = "Principal Component 2")

## Observations from Visualisation (scatter plot) - Cluster Distribution: The visualisations revealed distinct groupings, indicating that the K-means algorithm effectively identified separate clusters within the data. - Cluster Separation: Clusters exhibited overlapping boundaries, suggesting potential some dissimilarity between those groups.

# Extract PC scores for PC1 to PC3
pc_scores <- as.data.frame(pc$x[, 1:3])

# Perform K-means clustering with K = 6 using PCA results
set.seed(123)  # Setting seed for reproducibility
kmeans_result <- kmeans(pc_scores, centers = 4, nstart = 10)

# Visualize the clusters using fviz
fviz_cluster(kmeans_result, data = pc_scores, geom = "point", 
             ellipse.type = "convex", 
             palette = "jco", 
             ggtheme = theme_minimal(),
             main = "K-means Clustering Results (K = 4)")

## Observations from Visualisation (fviz_cluster) - Cluster Sizes: The clusters varied in size, reflecting the dataset’s inherent structure and the algorithm’s ability to group similar observations. - Dissimilarity: Clusters showed varying degrees of internal dissimilarity, with some clusters being more homogeneous than others.

3D Visualization of clusters

# Visualize the clusters in 3D
plot_ly(data = pc_scores, x = ~PC1, y = ~PC2, z = ~PC3, color = factor(kmeans_result$cluster), 
        type = "scatter3d", mode = "markers", marker = list(size = 6)) %>%
  layout(title = "K-means Clustering Results (K = 4)",
         scene = list(xaxis = list(title = "Principal Component 1"),
                      yaxis = list(title = "Principal Component 2"),
                      zaxis = list(title = "Principal Component 3")))

Cluster Statistics and Dissimilarity Calculation

Cluster Statistics: To gain deeper insights into the clusters, we computed various statistics: - Number of Observations: Each cluster’s size was calculated to understand the distribution of data points across clusters. - Dissimilarity Measures: Maximum and average dissimilarities within each cluster were evaluated using the Gower distance, a metric suitable for mixed data types. - Isolation: The isolation of each cluster was assessed by measuring the minimum distance between cluster centers, indicating how distinct each cluster is from others.

# Perform K-means clustering with K=6 using PCA results
set.seed(123)  # For reproducibility
kmeans_result <- kmeans(pc_scores[, 1:3], centers = 4, nstart = 10)

# Add cluster labels to the PCA scores
pc_scores_with_clusters <- cbind(pc_scores, cluster = kmeans_result$cluster)

# Function to compute cluster statistics
compute_cluster_stats <- function(cluster_data, cluster_centers) {
  cluster_stats <- cluster_data %>%
    group_by(cluster) %>%
    summarise(
      number_obs = n(),
      max_dissimilarity = max(daisy(cluster_data[, 1:3])),
      average_dissimilarity = mean(daisy(cluster_data[, 1:3]))
    )
  
  isolation <- sapply(1:nrow(cluster_centers), function(i) {
    min_dist <- min(dist(rbind(cluster_centers[i, ], cluster_centers[-i, ])))
    return(min_dist)
  })
  
  cluster_stats$isolation <- isolation
  return(cluster_stats)
}

# Compute dissimilarities and cluster statistics
dissimilarities <- daisy(pc_scores_with_clusters[, 1:3])
cluster_centers <- kmeans_result$centers
cluster_stats <- compute_cluster_stats(pc_scores_with_clusters, cluster_centers)

# Print the cluster statistics
print(cluster_stats)
## # A tibble: 4 × 5
##   cluster number_obs max_dissimilarity average_dissimilarity isolation
##     <int>      <int>             <dbl>                 <dbl>     <dbl>
## 1       1        219              10.5                  3.23      2.40
## 2       2        266              10.5                  3.23      2.40
## 3       3        127              10.5                  3.23      2.40
## 4       4        156              10.5                  3.23      2.40

K-Means Clustering Statistical Analysis

Key Findings

  1. Cluster Sizes: The number of observations in each cluster varied, with Cluster 2 being the largest (266 observations) and Cluster 3 being the smallest (127 observations). This variation in cluster sizes indicates a non-uniform distribution of data points across the clusters.

  2. Maximum Dissimilarity: All clusters exhibited the same maximum dissimilarity value of 10.51. This value represents the most dissimilar pair of observations within each cluster and highlights the maximum internal variability.

  3. Average Dissimilarity: The average dissimilarity within each cluster was consistent across all clusters at 3.23. This measure indicates the typical distance between observations within the same cluster, suggesting a similar level of internal cohesion.

  4. Isolation: The isolation metric, which measures the minimum distance between cluster centers, was also identical for all clusters at 2.40. This value reflects the degree of separation between the clusters, indicating that each cluster is equally distinct from the others.

Evaluation of K-Means Clustering

Evaluating the performance and validity of the K-means clustering algorithm is essential to ensure that the clusters formed are meaningful and distinct. This section presents the evaluation of the K-means clustering performed on the dataset using several key metrics.

Evaluation Metrics

  1. Calinski-Harabasz Index: The Calinski-Harabasz index, also known as the Variance Ratio Criterion, measures the ratio of the sum of between-cluster dispersion to the sum of within-cluster dispersion. Higher values of the Calinski-Harabasz index indicate better-defined and more distinct clusters. For the given clustering solution, the Calinski-Harabasz index was computed, providing a quantitative assessment of cluster separation and compactness.

  2. Dunn Index: The Dunn index is another metric used to evaluate the clustering quality by considering both the minimum inter-cluster distance and the maximum intra-cluster distance. A higher Dunn index indicates better clustering, as it signifies well-separated and compact clusters. The Dunn index was calculated for the clustering solution, helping to confirm the distinctiveness of the clusters.

  3. Silhouette Coefficient: The silhouette coefficient measures how similar an object is to its own cluster compared to other clusters. It ranges from -1 to 1, where a value close to 1 indicates that the data point is well-matched to its own cluster and poorly matched to neighboring clusters. The silhouette coefficients for each data point were calculated, and a silhouette plot was generated to visually assess the clustering quality. The plot highlighted the cohesion within clusters and the separation between different clusters.

# Assuming clusters contains the cluster assignments from clustering algorithm

# Compute the distance matrix
dist_matrix <- dist(pca_data)

# From kmeans clustering
clusters <- kmeans_result$cluster

# Compute Calinski-Harabasz Index
calinski_harabasz <- fpc::cluster.stats(dist_matrix, clusters)$ch

# Compute Dunn Index
dunn_index <- fpc::cluster.stats(dist_matrix, clusters)$dunn

# Print Evaluation Metrics
print(paste("Calinski-Harabasz Index:", calinski_harabasz))
## [1] "Calinski-Harabasz Index: 180.344150775887"
print(paste("Dunn Index:", dunn_index))
## [1] "Dunn Index: 0.0408701186773627"
class(dist_matrix)  # Should be 'dist'
## [1] "dist"
class(clusters)     # Should be 'integer' or 'factor'
## [1] "integer"
dim(dist_matrix)
## [1] 768 768
length(clusters)
## [1] 768
# Function to calculate Silhouette Coefficient
calculate_silhouette <- function(data, clusters) {
  library(cluster)
  sil <- silhouette(clusters, dist(pca_data))
  return(sil)
}


# Calculate silhouette coefficients
sil_scores <- calculate_silhouette(pca_data, clusters)

# Plot silhouette plot
fviz_silhouette(sil_scores, palette = "jco", main = "Silhouette Plot for K-means Clustering")
##   cluster size ave.sil.width
## 1       1  219          0.27
## 2       2  266          0.16
## 3       3  127          0.14
## 4       4  156          0.16

Cluster Profiles

To further understand the characteristics of each cluster, the mean values of the original variables were computed for each cluster. This analysis provided insights into the distinguishing features of each cluster, revealing how different variables contributed to the clustering results. A bar plot was created to visualise the cluster profiles, showing the mean values of the variables for each cluster and highlighting the differences between them.

# Perform K-means clustering with K = 4
set.seed(123)  # Set seed for reproducibility
kmeans_result <- kmeans(pca_data, centers = 4, nstart = 25)
clusters <- kmeans_result$cluster

# Assign cluster labels to clean_data
clean_data_with_clusters <- cbind(clean_data, Cluster = kmeans_result$cluster)

# Calculate mean of original variables by cluster
cluster_means <- aggregate(. ~ Cluster, data = clean_data_with_clusters, FUN = mean)

# Reshape data for plotting (assuming clean_data has appropriate column names)
cluster_means_long <- pivot_longer(cluster_means, cols = -Cluster, names_to = "Variable", values_to = "Mean")

# Create bar plot
ggplot(cluster_means_long, aes(x = Variable, y = Mean, fill = factor(Cluster))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Cluster Profiles (Clean Data)",
       x = "Variable",
       y = "Mean",
       fill = "Cluster") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Plotly Bar Chart

# Create interactive plotly bar plot
plot_ly(cluster_means_long, x = ~Variable, y = ~Mean, color = ~factor(Cluster), type = "bar") %>%
  layout(title = "Cluster Profiles",
         xaxis = list(title = "Variable"),
         yaxis = list(title = "Mean"),
         barmode = "group")

Comparison and Description of Clusters

Description of Clusters

Cluster 1 - Low Risk Young Adults

Age: The average age in Cluster 1 is 26.42 years.
Outcome: Only 9.9% of individuals in this cluster have diabetes (Outcome_1 = 0.099), while 90% do not (Outcome_0 = 0.90).
Glucose: The mean glucose level is 105.89 mg/dL, which is relatively low compared to the other clusters.
Insulin: Average insulin level is 96.2 µU/mL which is relatively normal.
BMI: The mean BMI is 26.75, indicating that individuals in this cluster are slightly overweight according WHO classification.
Blood Pressure: The average blood pressure is 65.09 mmHg, which is the lowest among all clusters.
Diabetes Pedigree Function: The mean value is 0.428, suggesting a mild genetic predisposition to diabetes.
Pregnancies: On average, individuals have 2.29 pregnancies.
Skin Thickness: The mean skin thickness is 21.079 mm, indicating thinner skinfold measurements compared to other clusters.

Cluster 2 - High Risk Middle-Aged Adults

Age: The average age in Cluster 2 is 35.05 years.
Outcome: A significant 76.5% of individuals in this cluster have diabetes (Outcome_1 = 0.765), while only 23.4% do not (Outcome_0 = 0.234).
Glucose: The mean glucose level is 164.67 mg/dL, the highest among all clusters.
Insulin: Average insulin level is 295 µU/mL, indicating higher insulin resistance or insulin therapy.
BMI: The mean BMI is 37, categorizing individuals in this cluster as obese.
Blood Pressure: The average blood pressure is 73.5 mmHg.
Diabetes Pedigree Function: The mean value is 0.693, indicating a higher genetic predisposition to diabetes.
Pregnancies: On average, individuals have 3.86 pregnancies.
Skin Thickness: The mean skin thickness is 34.4 mm, indicating thicker skinfold measurements.

Cluster 3 - Moderate Risk Young Adults

Age: The average age in Cluster 3 is 28 years.
Outcome: 33.1% of individuals in this cluster have diabetes (Outcome_1 = 0.331), while 69% do not (Outcome_0 = 0.69).
Glucose: The mean glucose level is 113 mg/dL.
Insulin: Average insulin level is 128 µU/mL.
BMI: The mean BMI is 37.46, indicating that individuals in this cluster are morbidly obese.
Blood Pressure: The average blood pressure is 74.1 mmHg.
Diabetes Pedigree Function: The mean value is 0.45, indicating a moderate genetic predisposition to diabetes.
Pregnancies: On average, individuals have 2.06 pregnancies.
Skin Thickness: The mean skin thickness is 35.4 mm, the highest among all clusters.

Cluster 4 - Moderate Risk Older Adults

Age: The average age in Cluster 4 is 47 years.
Outcome: 44.9% of individuals in this cluster have diabetes (Outcome_1 = 0.449), while 55% do not (Outcome_0 = 0.55).
Glucose: The mean glucose level is 127 mg/dL.
Insulin: Average insulin level is 141 µU/mL.
BMI: The mean BMI is 32, categorising individuals in this cluster as obese.
Blood Pressure: The average blood pressure is 80 mmHg, the highest among all clusters.
Diabetes Pedigree Function: The mean value is 0.42, indicating a moderate genetic predisposition to diabetes.
Pregnancies: On average, individuals have 8 pregnancies, the highest among all clusters.
Skin Thickness: The mean skin thickness is 30.42 mm.

Comparison of Clusters:

  • Age: Cluster 4 has the oldest average age (47 years), while Cluster 1 has the youngest (26.42 years).
  • Outcome (Diabetes Prevalence): Cluster 2 has the highest prevalence of diabetes (76.5%), followed by Cluster 4 (44.9%), Cluster 3 (33.1%), and Cluster 1 (9.9%).
  • Glucose Levels: Cluster 2 has the highest average glucose level (164.67 mg/dL), indicating a higher risk or presence of diabetes.
  • Insulin Levels: Cluster 2 also has the highest insulin levels (295 µU/mL), followed by Cluster 3 (128 µU/mL), Cluster 4 (141 µU/mL), and Cluster 1 (96.2 µU/mL).
  • BMI: Clusters 2 and 3 have similar high BMI values (37 and 37.46, respectively), indicating obesity, while Cluster 1 has the lowest BMI (26.75).
  • Blood Pressure: Cluster 4 has the highest blood pressure (80 mmHg), whereas Cluster 1 has the lowest (65.09 mmHg).
  • Diabetes Pedigree Function: Cluster 2 has the highest genetic predisposition to diabetes (0.693), while Clusters 1 and 4 have the lowest (0.428 and 0.42, respectively).
  • Pregnancies: Cluster 4 has the highest average number of pregnancies (8), indicating a potential factor in the higher age and diabetes prevalence in this cluster.
  • Skin Thickness: Cluster 3 has the highest skin thickness (35.4 mm), while Cluster 1 has the lowest (21.079 mm).

Strategic Interventions for Each Cluster

Cluster 1: Low Risk Young Adults

Intervention Focus: Prevention and Education

For Cluster 1, comprising young adults with a low risk of diabetes, strategic interventions should focus on prevention and education. Promoting a healthy lifestyle through continued encouragement of balanced eating habits and regular physical activity is essential to maintaining their normal weight and low glucose levels. Educational campaigns on diabetes prevention, specifically targeting young adults, can reinforce the importance of these habits. Additionally, advocating for routine health screenings to monitor vital signs such as glucose and insulin levels can help in early detection and prevention of diabetes.

Cluster 2: High Risk Middle-Aged Adults

Intervention Focus: Intensive Management and Support

Cluster 2 consists of middle-aged adults at high risk for diabetes, necessitating intensive management and support. Medical management, including insulin therapy and medications, is crucial to control high glucose and insulin levels. Specialized weight management programs should be implemented to address obesity and reduce related health risks. Given the high genetic predisposition to diabetes in this cluster, genetic counseling can provide valuable insights and management strategies. Regular health check-ups and increased monitoring frequency are imperative for early intervention and effective management of potential complications.

Cluster 3: Moderate Risk Young Adults

Intervention Focus: Risk Reduction and Monitoring

Young adults in Cluster 3, who face a moderate risk of developing diabetes, require interventions aimed at risk reduction and monitoring. Targeted diet and exercise programs can help address obesity and manage glucose levels. Regular health monitoring of glucose and insulin levels is essential to manage and reduce the risk of diabetes. Establishing support groups can provide necessary lifestyle modification guidance and peer support to encourage healthy habits. Preventive healthcare services, including routine screenings and early detection strategies, should be emphasized to mitigate the risk of diabetes.

Cluster 4: Moderate Risk Older Adults

Intervention Focus: Comprehensive Care and Lifestyle Adjustments

For older adults in Cluster 4 with a moderate risk of diabetes, a comprehensive care approach combining medical treatment and lifestyle adjustments is recommended. Chronic disease management programs tailored to older adults should focus on controlling BMI, blood pressure, and glucose levels. Encouraging participation in age-appropriate physical activity programs can improve overall health. These integrated care strategies aim to manage moderate risk factors effectively and improve health outcomes.

Conclusion

This report analysed the characteristics of four distinct clusters within a dataset, each representing different levels of diabetes risk. Cluster 1, comprising young adults with low diabetes risk, benefits from preventive education and lifestyle maintenance. Cluster 2, with high-risk middle-aged adults, requires intensive management and support, including medical treatments and weight management. Cluster 3, featuring young adults with moderate risk, should focus on risk reduction and continuous monitoring through targeted diet and exercise programs. Cluster 4, consisting of older adults with moderate risk, demands comprehensive care and lifestyle adjustments tailored to their specific health needs.

By identifying and understanding these clusters, we can implement strategic interventions that are tailored to the unique needs of each group. This targeted approach enhances the effectiveness of diabetes prevention and management efforts, ultimately leading to improved health outcomes and a reduction in the prevalence of diabetes.